Thursday, 27 August 2009

Tony Hoare and the meaning of life (well, almost...)

You know, I was always wondering about programming - is it an art, a craft, or is it an engineering discipline? Some crazy hackers maintain it be an art, more down-to-earth types oscillate between craft and engineering.

My personal feeling was that it couldn't be engineering - I missed the scientific part of it, it was too "soft". For me software engineering appeared rather to be a set of rules of thumb for organizing our mental work (functions, classes, modules...), something from the realm of cognitive science perhaps :-). For example I couldn't take my multithreading program and prove it not to deadlock in a future (i.e not yet written, but planned) installment of this blog...

1. The presentation

And then, some months ago, I came across that presentation* given on QCon 2009 Conference by Sir C.A.R. Hoare (aka Tony Hoare of the Hoare logic, Quicksort, and CSP fame) about relationships between programming and computer science.

Computer Science is about how programs run and how correct programs can be constructed *
By the way, I used to dislike the all-present video presentations, which seem to be replacing the old-fashioned articles as programmer's information source of choice. Well, to put it frankly, I hated them! Instead of beeing able to quickly scoop the essentials and then read the article if it'd interest me, I'm now forced to hear for hours to some uninspiring presentations and often to some really horrible accents to boot, as to discover that in the end only the title of the video was interesting!

But on the other side, with the video presentations I'm no able to hear people like Linus Thorvald** and Tony Hoare in person, and it proved to be very rewarding to me in both cases. Watching Tony Hoare was a great pleasure - he's a wonderfully gentle and good humored old man, I'd say my idol of sorts. And as he's got his first degree in philosophy, so he's bound to have some interesting insights in the question at hand.

His answer is both simple and compelling (I'm rephrasing it here):
Software Engineering is an engineering discipline because it uses Computer Science, and Computer Science is a science because its results are exploited and confirmed by software products!*
Sounds at first rather convincing, doesn't it? But wait, for a philosophy major, isn't he overlooking something?

2. Circulus Vitiosus?

Well, embarassingly, it's looking suspiciously similiar to a circular argument , right? For SW Engineering gets justified by the CS but the CS is justified by SW Engineering? It certainly appeared like this to me at the first glance So let us have a closer look at it:

science is: "the systematic study of the structure and behaviour of the physical world, especially by watching, measuring and doing experiments, and the development of theories to describe the results of these activities "***
Does this apply here? Well, maybe, if for "physical world" we substitute the human-built things like hardware and binaries. But wait, don't we rather work with mental constructs instead (you know, languages, functions, classes)? Well, yes, but they are models, like math is a model in civil engineering. In the end it's all about how the binary runs!

engineering is: "to design and build something using scientific principles"***
This one seems to fit perfectly. So both parts of the argument are corrct, but it is a circular one? Let it translate it form English into the language of logic:
(CS usedIn SWEng => SWEng is Eng) and (CS usedIn SWEng => CS is Sc)
You see it's not really circular, as it's two separate propositions, and they are just rephrasing the two above definitions (check it!). Because science is defined as a special, rigorous kind of examination of the physical world, and engineering uses science when dealing with this world directly! But wait again, it's not said that we are allowed to use only scince in engineering, or ist it?

3. The Engineer an the Scientist

This leads us to another interesting aspect of that presentation, which was the comparison between science and engineering, and how they relate to each other. Somehow explaining the above realtionship (I must paraphrase again):

so it's first:
scientists are interested in pure truths, i.e. programs which doesn't have errors, while an engineer must compromise, i.e. live with programs that have got some errors,

and second:
scientists are interested in certainity, while an engineer lives in incertanity, must take risks and perform risk management,

and thus:
scientists should be women, engineers should be men (Yes, he really made this little joke:)

There are several other comparisions along these lines in the presentation, and furthermore, even a gradation between the two extremes like applied scientist or scientific engineer are introduced. So there is (almost) a continuum of positions between the two extremes of pure CS scientist an a common hacker.

And that fact harbors some kind of consolation, particularly if I'm in a very pedestrian project and must do the most boring things. Just remember, you can always move up the ladder, and use more science, more absolute truths, and thus (as the ancients believed) come in contact with more beauty. An that's maybe the ultimate "consolatio philosophiae“, or do you think this is an exaggeration on my part?

--
* all citations are not original but rephrased by yours truly, so if they are wrong it's my fault! Original source: Tony Hoare, "The Science of Computing and the Engineering of Software", QCon Conference, Jun 11, 2009 , London: http://www.infoq.com/presentations/tony-hoare-computing-engineering

** for example this presentation by Linus was fun: http://www.youtube.com/watch?v=4XpnKHJAok8&feature=player_embedded

*** the definitions are taken from "Cambridge Advanced Learner's Dictionary", as I was looking for simplicity: http://dictionary.cambridge.org/

Tuesday, 28 July 2009

Two C++ curiosities with a deeper meaning

Lately, I was quite surprised by two things in the realm of programming, both of them C++ related. The first one is the new attribute syntax for C++09*, which I curiously failed to notice in the new standard proposal as I first had a look at it.

1. The first one


What are attributes? No, they aren't class data members with a convenient access syntax in this case (like Groovy's attribute syntax - or was it Python?).

I personally only knew attributes as an ugly Microsoft Windows COM-specific hack, by virtue of which you can inject COM related information into C++ code. Look at this example:

  #define _ATL_ATTRIBUTES 1
#include <atlbase.h>
#include <atlcom.h>
#include <string.h>
#include <comdef.h>
  [module(name="test")];
  [ object, uuid("00000000-0000-0000-0000-000000000001"), library_block ]
__interface IFace
{
[ id(0) ] int int_data;
[ id(5) ] BSTR bstr_data;
};
  [ coclass, uuid("00000000-0000-0000-0000-000000000002") ]
class MyClass : public IFace
{
private:
int m_i;
BSTR m_bstr;
  public:
MyClass();
~MyClass()
    int get_int_data();
void put_int_data(int _i);
BSTR get_bstr_data();
void put_bstr_data(BSTR bstr);
};
Doesn't it send shivers down your spine? You might say I'm not consequent, because I just lashed out on excessive XML configuration files (here), but when presented with the alternative, I don't like it either! And now, there are annotations in Java too, doing a similiar thing... What I can say, I was growing up with the classical paradigm of the standalone programm using external functionality through libraries. But nowadays you don't program just for the machine or the OS, you program for same giant framework! Did you hear the phrase "programming by bulldozer"? And to my mind there's no satisfactory model of how to do this! For me both the configuration file madness and mixing code with meta information are both somehow unsatisfactory. Look at this example JBoss Seam action class:
  @Stateless@Name("manager")
public class ManagerAction implements Manager {
@In @Out private Person person;
@Out private List<Person> fans;
}
Well, what do you think, does it look OK? We have (nearly) POJOs here (i.e. no framework class derivation) which is the Holy Grail in Java programming lately, but the code is littered with annotations. To me they are no better than the old C++ event macros used in some very old Windows frameworks!

Now the question arises: is that a cognitive bias caused by the preceeding simple (and thus elegant) programming model? Remember, even Herb Sutter, when charged with the task of making .NET programming possible with C++, didn't embrace the traditional program + library model! He rather created a language extension, the C++/CLI langauge dialect. Maybe he was constrained to do so by his commision of getting Managed C++ up to speed, but on the other side, in the design paper he argued it's the natural way and compliant with the spirit of C++. Personally I didn't like this language extension a bit, opting for a carefully designed library or framework oriented solution, but maybe it's simply not possible???

Sorry, I digressed. So back to the subject!

The second species of attributes I knew were the Gnu attributes, looking like that:
  void kill(int) __attribute__((__noreturn__)); //GCC specific
but I never used them either, because I thought they were only of any use with som Gnu-specific langauge extensions, which is always avoided like the plague. And do you know how the new standard compliant attributes look like now? Look ("C++09 Working Draft n2798", 10.2008, Chap. 7.6.):
  class C
{
virtual void f [[final]] ();
};

It's a crossbred of Microsoft an Gnu annotations, isn't it? That's why everyone is happy, and why the syntax is really ugly. And do you know what? There are a fistful of annotations defined in the standard (but vendors are allowed to add new ones) and one of them is, you guess it, final! Could someone tell me why should this one be an annotation and not a keyword - is it supposed to be matter in specific environments? Or are we making the way clear for vendor specific language extension? So maybe the whole annotation thing is in there to placate the major vendors: Microsoft and Gnu? Because as I got it (correct me please if I'm wrong!) there isn't a mechanism to plug in an attribute processor as in Microsoft COM-attributes or Java's annotations, so the whole construct is aimed at the compiler writers.

Summing up: first, Java annotation syntax, as seen in the Seam example above, seems to be more pleasing to the eyes, and second, I didn't really grasp the need for standardized annotations, except for vendor's conveniency.

2. The second one

The second curious thing I stumbled upon is the removal of Concepts (or should say of the "Concepts" concept?) from the C++09 standard proposal. As I read**, it was the standard library where the Concepts were first removed from. Come again? As I had a look at the draft, all the std library chapters were strewn with concepts, every one of them in pretty blue (or was is green?) colour.

And wasn't the Concepts supposed to be the best thing since sliced bread? I mean not for me, as I had a look at it one, and didn't want to learn it (in patricular the whole concept_map stuff!!!), but for the compiler writers - they could at last avoid the horrendous error messages for type errors in the template instatiations. Beside of that, it was Bjarne's pet project, and I thought that all Bjarne's PhD and post-doc students were working at it!

And now, out of the blue, the whole feature was cancelled! So maybe there's a limit to the pushing of the type system? C++ is a relentless type-monster already, and sometimes programming it feel like writing a mathematical proof to me. Thus building another level of type definitions on top of the generic meta types is maybe too much of the good thing. But, OK, the idea is not to take out the "duck typing" altogether from the templates, but to strike a delicate balance: add an minimum of concept decalarations and get the best error messages and type security possible. Is that task simply to complicated and too big or is it only the current design of the mechanism which is flawed?

As Bjarne said in the paper which finally did the deathly blow to the Concepts***:

Concepts were meant to make generic programming easier as well as safer. It is part of a whole collection of features aimed at simplifying GP, together with template aliases, auto, decltype, lambdas, etc. However, "concepts" is a complex mechanism and its language-technical complexity seems to be leaking into user code. By "language-technical complexity" I mean complexity arising from the need of compiler/linker technology rather than complexity from the solution to a problem itself (the algorithm).

My particular concern is that in the case of concept maps, in the name of safety we have made templates harder to use. We require programmers to name many entities that would better be left unnamed, cause excess rigidity in code and encourage a mindset in programmers that will lead to either a decrease in generic programming (in favor of less appropriate techniques) or to concepts not being used (where they would be useful). We have overreacted to the problems of structural typing. ***
This means that exaclty what I wasn't willing to learn leaked out into the user space! Bjarne made a series of proposals how to avoid it, but my general feeling after having read the paper was that the concept's design is somehow flawed, and cannot be fixed by some quick workarounds. So it's no wonder that the commitee had to make a tough decision. And that after years of work! And nobody have seen it coming all these years? Don't you think it's unacanny - how little we can do as to design a complicated language feature and to avoid mistakes? Or maybe we see here how badly the standardisation process is working. These problems were there for a much longer time and nobody said a word! Because of lack of time, because of politics, or simply because noone really understood that topic? So it had to be the creato of C++ himself to blow the whistle:

The use of concepts is supposed to help people write and use a wide range of templates. The current definition of concept maps and the philosophy that seems to go with them makes it harder.

Addressing this is important. I suspect that the alternative is widespread disuse of concepts and libraries using concepts. I would consider that a major failure of C++0x. ***

I'd say he killed his pet project out of sense of responsibility.

And where is the promised deeper meaning? As I said previously: there's maybe a limit to the type system. Because for me it felt sometimes like constructing inaccessible cardinals on top if the regular ones (wiki) in the set theory - just very, very subtle. And maybe the horrendous error messages for templates are just the proverbial price of the (type) freedom, and we have better to pay it? When you read Bjarne's article*** you see, that the gratest problems are encountered in sublassing and conversions - so C++ is much more "modern" a language than most of us would like to believe, as it allows us so much of the type freedom using the type system itself to achieve it!

And maybe the design problem relatively simple to solve, but the archaic linker technology we inherited from C is simply too old for that? As it reads:

"... complexity arising from the need of compiler/linker technology rather than complexity from the solution to a problem itself (the algorithm)"***
---
* Danny Kalev "Introducing Attributes", http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=440
** Danny Kalev "The Removal of Concepts From C++0x", http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=441
*** Bjarne Stroustrup "Simplifying the use of concepts", 2009-06-21, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2009/n2906.pdf

Saturday, 27 June 2009

For German Speakers only...

Hi everyone! As it comes, I stumbled recently upon an interesting C++ website, which I quite enjoyed, and even happened to learn something new in the process! Its address is:

http://www.kharchi.eu/wiki/doku.php?id=cpp:start

I particularly liked the coverage of some more modern topics liker my favourite theme of Web programming in C++, embedded C++ scripting, or the string tutorial covering conversions between different encodings (and not to forget the link to the Pirates Party :)). Don't get me wrong, it's not the uber C++ webpage, but it's a cute, interesting little thing!

Unfortunatlely it's all in German :-((. Good for me (see, learning languages pays off sometimes), bad luck for the rest of you. So if you're speaking German, have a look at it!

Speaking about languages, there's a little story as a goodie: on some project several years ago I had to update an open source tool for copying entire websites. It was thousands and thousands lines of plain, spaghetti-style C code. As this wouldn't be bad enough, all the comments that were there weren't in English, no, they were all in French!!! Hadn't I learnt some French for my holidays, I couldn't possibly have this assignment finished! Without comments the code was absolutely incomprehensible! However it worked fine, even if it wasn't any OO or structured programming thing. Why? For all we know it shouldn't ;-))).

Thursday, 28 May 2009

Windows not so bad after all?

This post is for my distinguished hacker colleague Stefan Z. Do you remember how we were always complaining that you cannot do any serious development on Windows, as the following standard Unix command line isn't posible?
    cat out.txt | grep ERROR
Well, as I'm doing Windows programming now, I recently found out that it's possible after all! Incredible but true! Hovever the syntax is somehow different:
    type out.txt | findstr ERROR
On things like that I see how fast the time is passing! A couple of years before, I'd never believe that Windows would ever catch on. I think, the tide change came with XP.

But wait, it comes even better! Here's a script for checking if a process is running and killing it if so:

   tasklist /FI "IMAGENAME eq myProcess.exe" | findstr myProcess.exe
if ERRORLEVEL 1 GOTO next

echo "-- kill running myProcess"
taskkill /F /IM myProcess.exe

next:
echo "-- starting the XXX process"
Even a sleep is possible (athough it's arguably very ugly):
    @ping 127.0.0.1 -n 2 -w 1000 > nul

And here (in the IT-Dojo) you can see that the Windows command promtp has some interesting history mechanisms!

Tuesday, 19 May 2009

Two Java Collection Utilities

Lately I stumbled upon two iteresting Java libraries which add search capabilities to Java collections: one adding SQL search and another adding XPath search syntax to an existing collection. I think it's cool - you can take a collection of Java objects and treat it in a high level way! I suppose they are implemented using introspection.

The first one is called JoSQL (http://josql.sourceforge.net/index.html) and can be used like this:
try {
Query q = new Query ();
q.parse(
"SELECT *" +
"FROM java.lang.String " +
"WHERE length <= :stringSize " +
"AND toString LIKE '%e%' " +
"ORDER BY length DESC");

q.setVariable ("stringSize", "5");
List<String> names = new ArrayList<String>();
String[] n = {
"alpha", "beta", "gamma", "delta", "epsilon", "zeta", "pi", "chi"
};
Collections.addAll(names, n);

QueryResults qres = q.execute(names);
List res = qres.getResults();
}
catch(Exception e) ...
Here I got { "beta", "delta", "epsilon" } as result. The only problem I can see as far is that the library doesn't use generics. The other one is the somehow unintuitive way it's treating the "SELECT *" query. Nomally the JoSQL library treats a collection of Java objects as a table which columns are formed by the values returned by the class' methods. However, in case of a "SELECT *" query, we won't get all of the object's attributes, but just the whole object as it is. On the other side, it's the equivalent, isn't it.

The second one is called jXPath (http://commons.apache.org/jxpath/), and is part of the Apache project. You can use it like that:
    JXPathContext context = JXPathContext.newContext(employees);

Employee emp =
(Employee)context.getValue("/employees[name='Susan' and age=27]");
Pretty nifty, isn't it? But while SQL query syntax seemed obvious to me, the XPath search expressions seems plain ugly and non-intuitive. I don't know both SQL and X-Path very good, but I find myself always forgetting pretty everything about XPath while SQL gets stuck. Maybe because XPath is too low-level an too much a hack?

You can read in many blogs that the above libraries are pretty the same, but in reality, the XPath query syntax is better for hierarchical data, somehow like that:
    Employee emp =
(Employee)context.getValue("/departments/employees[name='Johnny']");
I assume you can do the equivalent in JoSQL, but you'd like to use joins, an that isn't that convenient for simple hierarchical structures like that above. So, in the end, maybe the XPath isn't so bad after all?

Wednesday, 29 April 2009

From DLL- to XML-hell

This is one of the original articles I wanted to write when I started this blog. Well, it's taken some time, it's a little outdated now, but at last I've kept my promise (or rather one of them) !

1. The Lament


As I switched from C++ to Java in some previous project I thought I plunged directly from the DLL- to the XML-hell! Well, not exacltly (as I didn't programm under Windows in the preceeding project) but a little hyperbole makes a good start! The XML part of the title however, is true. The very first thing I noticed, was the ubiquitous XML - you had to use it for compilation (Ant scripts), you had to use it for for deployment (web.xml), you had to use it for your application (struts2.xml, spring.xml, log4j.xml).

The most annoying part of this was the change from makefiles to Ant scripts: so much superfluous verbiage! Look, compare the Ant script with the equivalent Ant-builder script in Groovy, taken from the "Groovy in Action" book, first the XML file:

<project name="prepareBookDirs" default="copy">
<property name="target.dir" value="target"/>
<property name="chapters.dir" value="chapters"/>
<target name="copy">
<delete dir="${target.dir}" />
<copy todir="${target.dir}">
<fileset dir="${chapters.dir}"
includes="*.doc"
excludes="~*" />
</copy>
</target>
</project>
Then the builder script:
  TARGET_DIR = 'target'
CHAPTERS_DIR = 'chapters'
  ant = new AntBuilder()
ant.delete(dir:TARGET_DIR)
ant.copy(todir:TARGET_DIR){
fileset(dir:CHAPTERS_DIR, includes:'*.doc', excludes:'~*')
}

Well, that's at least readable (and writeable), and it states exactly what you want! When I see such a comparison I think that the XML code should be used only as Ant-internal, assembly-code level data representation! You could maybe find it interesting that even the creator of Ant himself admitted that using XML was "probably an error"*.

The second problem was that, (as another programmer put it):

Developers from other languages often find Java's over-reliance on XML configuration annoying. We use so much configuration outside of the language because configuration in Java is painful and tedious. We do configuration in XML rather than properties because...well, because overuse of XML in Java is a fad.**

I can only confirm this. You can get a feel of how much of XML "code" was needed in a Java web application I was then writing from the following (addmittedly pretty old) table*** :

     Metric               Java +              Ruby +
Spring/Hibernate Rails


Time to market 4 months, approx. 4 nights,
20 hours/week 5 hours/night
Lines of code 3,293 1,164
Lines of config 1,161 113
Number of classes/ 62/549 55/126
methods

Do you see? The configuration, it's 30% of the code (!!!), and that's 10 times more than was needed in Rails! That's some dependency! By the way, note that Rails needs only 3 times less code that Java, despite the fact that Java libraries can be sometimes pretty low level!

Now, for me the Spring 2 platform was somehow an apogeum of the XML overdependency - the endless configuration files weren't even human readable, you should better use a graphical Eclipse plugin:

to edit the wirings. That's like you need something like UML to tame the complexity of the XML files (and of Spring's configuration of course):


A rhetoric question: isn't it too much of a good thing for something as simple as configuration files? I found the Spring-2 MVC (Spring's web GUI application framework) particularly bad in that respect. Comparing it to the Struts 2 configuration, I had an impression that every obvious fact must be expressed in XML, and that there's no defaults altogether:

<bean name="userListController"
class="de.ib-krajewski.myApp.UserListController">
<property name="userListManager">
<ref bean="userListManager" />
</property>
....
</bean>
<bean id="userListManager"
class="de.ib-krajewski.myApp.UserListManager" />

and so on, and on, and on.

But wait, it comes even better: some people are so enamoured with XML that they are using it as a programming language****, either doing patterns with Spring and XML configuration files, or even inventing an XML-based programming language and writing its interpterer in XSLT (I'm not joking!).

A word of caution: Java community noticed the problem and is trying to reduce the amount of XML needed. Unfortunately, I haven't got any hard numbers, but each new release of a Java framework seems to claim a "reduced XML config file size" - take for example Spring MVC v.3. However, as the DLL hell is still a reality despite of Microsoft's efforts and the introduction of manifests (really, look here!), equally, for my mind the overdependence on XML won't vanish overnight from the Java world - it's at its core!

2. The Analysis

2.1 XML data

Why are we all using XML? The received standard answer is: because it's a portable, human-readable, standardized data format with a very good tool support!

So why does it all feel so wrong? The first reason I see is that XML is too low level for human consumption. It's claiming to be human-readable while it's only human-tolerable - see the Groovy AntBuilder example above. There are tons and tons of verbiage (although that's something a Java programmer will be rather accustomed to ;-)) which is what machines need, but it's not how human mind is working. It needs high level descriptions because it's only all to easy distracted by unneeded details.

When I recall my own config file implementations of yore, I conclude that I never needed more constructs than single config values ans lists of them. And I never needed hierarchical config data for my systems. So if people are using XML for configuration, they are maybe after something more than pure config settings, maybe they are looking for a scripting solution for the application (as mentioned in*) along the lines of the late Tcl? But with Tcl the model was entirely different: the Tcl scipts glueing together passive modules of code, whereas now, the modules are active and use XML as a database of sorts... So is it that at last? People using XML data as a poor man's, read only database? It's a good, ad-hoc solution, and you can emulate hierarchical, network and OO databases without much of a hassle, can't you? So people are using XML as a kind of qiuck and dirty hack, and hacks aren't normally very pretty ;-).

2.2 XML in Java

The second part of the question is: why are we using so much XML in Java? The answer is probably that we need XML config files to increase modularity an require the loose coupling of our systems. The Spring framework allows us to pull the parts of the system together at the startup or run-time, and that's supposedly a good thing, isn't it.

Coming to this part of the question: I think that the loose coupling thing is overrated. As Nicolai Yossutis said in his book about SOA (we reported ;-)):

"...loose coupling has its price, and it's the complexity."

So firstly, a little bit of coupling isn't that bad if it decreases complexity and increases readability of code! To my mind, it's a classical tradeoff, but as some developers are only too prone to introduce tight coupling all over the place, the received wisdom is to avoid the coupling at every price! But be honest, you can't decouple everything, if the parts are ment to be working together! Secondly, I thing that the "inversion of control" pattern is a bit of overkill too.

Personally, I'd like to glue together the parts of the system directly in code, so that I can see how my software is composed, and that in my source code - it should be the sole point of reference! Thus I like much more the approach Guice or Seam are taking (and nowadays the Spring framework itself) - that of annotations. At last we have the relevant information where it belongs - in code! The problem here is, that the annotations aren't really code, they are metadata, and I don't like the idea of using metadata where you could do things directly by using some API. But hey, that's probably the way we are doing such things in Java...

3. A Summary

That's what I wanted to write approximately 2 years ago. Meanwhile I use XML everywhere in my architectures and designs, mainly because it excludes any discussions about formats! No one has ever said anything against XML!

But programming it (in C++ and Qt) is a real pain - I first tried DOM, than SAX - and it was an even greater mess (well, there's a C++ data binding implementation á la Java's Apache XMLBeans but it's not used at my client's :-((). XPath's selectors are supposed to be better, but Qt doesn't have it by now. The good thing is, everyone will think the system is sooo cool because it uses XML, and I have to make my users happy!

PS: And you know what? The new dm application server by SpringSource uses JSON for configuration files!!! I told you so ;-)

--
* "The creator of Ant excercising hit own daemons", this article was originally stored under http://x180.net/Articles/Java/AntAndXML.html, but this link seems to be dead, so I'll host it on my website for a while (hope it's OK...): http://www.ib-krajewski.de/misc/ant-retrospect.html

** as a matter of fact, he's a Ruby programmer too, so he isn't that unpartial: http://www.brainbell.com/tutorials/java/About_Ruby.htm

***taken from the book "Beyond Java" by Bruce Tate: http://commons.oreilly.com/wiki/index.php/Beyond_Java/Ruby_on_Rails, unfortunately the original link stated there isn't working: Justin Gehtland, Weblogs for Relevance, LLC (April 2005); http://www.relevancellc.com/blogs. "I *heart* rails; Some Numbers at Last."

**** "Wiring The Observer Pattern with Spring": http://www.theserverside.com/tt/articles/content/SpringLoadedObserverPattern/article.html, and "XSL Transformations. A delivery medium for executable content over the Internet" in DDJ from April 05, 2007: http://www.ddj.com/architect/198800555 . I cite from the latter:

XIM is an XML-based programming language with imperative control features, such as assignment and loops, and an interpreter written in XSLT. XIM programs can thus be packaged with an XIM processor (as an XSLT stylesheet) and sent to the client for execution.
See, I wasn't joking...

Saturday, 4 April 2009

Some distributed computing news: cuddly elephans, pigs, nymphs and couch computing

As I did some distributed computing research in my previous life, and then tried to get involved with Google, I'm still somehow interested in the subject. If you are into distributed computing, you'd know that Google is using a distributed C++ system called MapReduce* for indexing of the crawled webpages. It happens that there's an open-source Java implemenation of Google's system (based on published Google's research papers) called Hadoop . Up to now I didn't know about any serious usage of Hadoop in big distributed systems, but as it comes, I learned that Yahoo! is using Hadoop internally** for spam detection!

Yahoo's researchers have also designed a new language called PIG Latin (?!)*** for distributed programming on the Hadoop platform. The idea is rather an interesting one: they pointed out that the existing way of writing programs for the mapper and reducer part of a distributed application is too low-level and not reusable, but somehow familiar to programmers. In contrast, a declarative language like SQL poses some problems, so the paradigm there is completely diffrent:

At the same time, the transformations carried out in each step are fairly high-level, e.g.,filltering, grouping, and aggregation, much like in SQL. The use of such high-level primitives renders low-level manipulations (as required in map-reduce) unnecessary.
...

To experienced system programmers, this method is much more appealing than encoding their task as an SQL query, and then coercing the system to choose the desired plan through optimizer hints.
So it provides a middle ground, where we can avoid SQL and write down the computation steps, but can do that at higher level!

Unsuprisingly, Microsoft already has a proprietary distributed computing platform called Dryad, which tries to do something like this as well, but is using C# and LINQ (i.e. an SQL-like query language):

DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.

But I don't know how much this is a research project and if it's really used internally. On the other side, Microsoft recently aquired a search company called Powerset, which is using Hadoop at its core, and as the gossip runs, should be used to bring Microsoft's LiveSearch up to speed.

Facebook, in their turn, are using Hadoop, but added a Business Intelligence tool called Hive (now released as open source!) on top of it. It uses an SQL-like query language, as BI-users understand it best. Beneath the surface Hive leverages Hadoop and translates SQL-like imperatives into MapReduce jobs.


Not bad, cuddly elephant! It seems everyone is loving you!


Accidentally, when speaking about SQL and non-SQL: there's a non-SQL database by Apache called Couch DB. It is written in Erlang (we reported already**** ;-)) and implements a distributed document database. It's designed for lock-free concurrency using Erlang's "share-nothing" philosophy. To my mind it resembles Linus' Thorvalds git source configuration system, as it holds all the versions of documents determining the "winner" to be visible in defined views. Hear, hear! Erlang's alive, distributing and kicking!

Another thing worth noticing is that's not only SQL anymore, there are attempts to do things differently, athough others are sticking with the ol 'n reliable. Paradigm change on its way? It's interesting times I think.

PS (6 Okt. 2009): Look, look, now even the Big Blue has jumped the wagon too! They offer the M2 enterprise data analysis platform, based on Hadoop and using PIG. See: http://www.sdtimes.com/link/33808. So it seems Hadoop is definitely mainstream and business analyst thingie now, and as such it has lost much of its appeal, as far as I'm concerned.

---
* Jeffrey Dean, Sanjay Ghemawat: "MapReduce: Simplified Data Processing on Large Clusters", OSDI 2004, http://labs.google.com/papers/mapreduce.html
** unfortunately it's a two-parts video: http://developer.yahoo.net/blogs/hadoop/2009/03/using_hadoop_to_fight_spam_-_part_1.html
*** Christopher Olston et al: "Pig Latin: A Not-So-Foreign Language for Data Processing", SIGMOD 2008, http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf
**** http://ib-krajewski.blogspot.com/2007/08/erlangs-change-of-fortunes.html