Wednesday, 29 April 2009

From DLL- to XML-hell

This is one of the original articles I wanted to write when I started this blog. Well, it's taken some time, it's a little outdated now, but at last I've kept my promise (or rather one of them) !

1. The Lament

As I switched from C++ to Java in some previous project I thought I plunged directly from the DLL- to the XML-hell! Well, not exacltly (as I didn't programm under Windows in the preceeding project) but a little hyperbole makes a good start! The XML part of the title however, is true. The very first thing I noticed, was the ubiquitous XML - you had to use it for compilation (Ant scripts), you had to use it for for deployment (web.xml), you had to use it for your application (struts2.xml, spring.xml, log4j.xml).

The most annoying part of this was the change from makefiles to Ant scripts: so much superfluous verbiage! Look, compare the Ant script with the equivalent Ant-builder script in Groovy, taken from the "Groovy in Action" book, first the XML file:

<project name="prepareBookDirs" default="copy">
<property name="target.dir" value="target"/>
<property name="chapters.dir" value="chapters"/>
<target name="copy">
<delete dir="${target.dir}" />
<copy todir="${target.dir}">
<fileset dir="${chapters.dir}"
excludes="~*" />
Then the builder script:
  TARGET_DIR = 'target'
CHAPTERS_DIR = 'chapters'
  ant = new AntBuilder()
fileset(dir:CHAPTERS_DIR, includes:'*.doc', excludes:'~*')

Well, that's at least readable (and writeable), and it states exactly what you want! When I see such a comparison I think that the XML code should be used only as Ant-internal, assembly-code level data representation! You could maybe find it interesting that even the creator of Ant himself admitted that using XML was "probably an error"*.

The second problem was that, (as another programmer put it):

Developers from other languages often find Java's over-reliance on XML configuration annoying. We use so much configuration outside of the language because configuration in Java is painful and tedious. We do configuration in XML rather than properties because...well, because overuse of XML in Java is a fad.**

I can only confirm this. You can get a feel of how much of XML "code" was needed in a Java web application I was then writing from the following (addmittedly pretty old) table*** :

     Metric               Java +              Ruby +
Spring/Hibernate Rails

Time to market 4 months, approx. 4 nights,
20 hours/week 5 hours/night
Lines of code 3,293 1,164
Lines of config 1,161 113
Number of classes/ 62/549 55/126

Do you see? The configuration, it's 30% of the code (!!!), and that's 10 times more than was needed in Rails! That's some dependency! By the way, note that Rails needs only 3 times less code that Java, despite the fact that Java libraries can be sometimes pretty low level!

Now, for me the Spring 2 platform was somehow an apogeum of the XML overdependency - the endless configuration files weren't even human readable, you should better use a graphical Eclipse plugin:

to edit the wirings. That's like you need something like UML to tame the complexity of the XML files (and of Spring's configuration of course):

A rhetoric question: isn't it too much of a good thing for something as simple as configuration files? I found the Spring-2 MVC (Spring's web GUI application framework) particularly bad in that respect. Comparing it to the Struts 2 configuration, I had an impression that every obvious fact must be expressed in XML, and that there's no defaults altogether:

<bean name="userListController"
<property name="userListManager">
<ref bean="userListManager" />
<bean id="userListManager"
class="de.ib-krajewski.myApp.UserListManager" />

and so on, and on, and on.

But wait, it comes even better: some people are so enamoured with XML that they are using it as a programming language****, either doing patterns with Spring and XML configuration files, or even inventing an XML-based programming language and writing its interpterer in XSLT (I'm not joking!).

A word of caution: Java community noticed the problem and is trying to reduce the amount of XML needed. Unfortunately, I haven't got any hard numbers, but each new release of a Java framework seems to claim a "reduced XML config file size" - take for example Spring MVC v.3. However, as the DLL hell is still a reality despite of Microsoft's efforts and the introduction of manifests (really, look here!), equally, for my mind the overdependence on XML won't vanish overnight from the Java world - it's at its core!

2. The Analysis

2.1 XML data

Why are we all using XML? The received standard answer is: because it's a portable, human-readable, standardized data format with a very good tool support!

So why does it all feel so wrong? The first reason I see is that XML is too low level for human consumption. It's claiming to be human-readable while it's only human-tolerable - see the Groovy AntBuilder example above. There are tons and tons of verbiage (although that's something a Java programmer will be rather accustomed to ;-)) which is what machines need, but it's not how human mind is working. It needs high level descriptions because it's only all to easy distracted by unneeded details.

When I recall my own config file implementations of yore, I conclude that I never needed more constructs than single config values ans lists of them. And I never needed hierarchical config data for my systems. So if people are using XML for configuration, they are maybe after something more than pure config settings, maybe they are looking for a scripting solution for the application (as mentioned in*) along the lines of the late Tcl? But with Tcl the model was entirely different: the Tcl scipts glueing together passive modules of code, whereas now, the modules are active and use XML as a database of sorts... So is it that at last? People using XML data as a poor man's, read only database? It's a good, ad-hoc solution, and you can emulate hierarchical, network and OO databases without much of a hassle, can't you? So people are using XML as a kind of qiuck and dirty hack, and hacks aren't normally very pretty ;-).

2.2 XML in Java

The second part of the question is: why are we using so much XML in Java? The answer is probably that we need XML config files to increase modularity an require the loose coupling of our systems. The Spring framework allows us to pull the parts of the system together at the startup or run-time, and that's supposedly a good thing, isn't it.

Coming to this part of the question: I think that the loose coupling thing is overrated. As Nicolai Yossutis said in his book about SOA (we reported ;-)):

"...loose coupling has its price, and it's the complexity."

So firstly, a little bit of coupling isn't that bad if it decreases complexity and increases readability of code! To my mind, it's a classical tradeoff, but as some developers are only too prone to introduce tight coupling all over the place, the received wisdom is to avoid the coupling at every price! But be honest, you can't decouple everything, if the parts are ment to be working together! Secondly, I thing that the "inversion of control" pattern is a bit of overkill too.

Personally, I'd like to glue together the parts of the system directly in code, so that I can see how my software is composed, and that in my source code - it should be the sole point of reference! Thus I like much more the approach Guice or Seam are taking (and nowadays the Spring framework itself) - that of annotations. At last we have the relevant information where it belongs - in code! The problem here is, that the annotations aren't really code, they are metadata, and I don't like the idea of using metadata where you could do things directly by using some API. But hey, that's probably the way we are doing such things in Java...

3. A Summary

That's what I wanted to write approximately 2 years ago. Meanwhile I use XML everywhere in my architectures and designs, mainly because it excludes any discussions about formats! No one has ever said anything against XML!

But programming it (in C++ and Qt) is a real pain - I first tried DOM, than SAX - and it was an even greater mess (well, there's a C++ data binding implementation á la Java's Apache XMLBeans but it's not used at my client's :-((). XPath's selectors are supposed to be better, but Qt doesn't have it by now. The good thing is, everyone will think the system is sooo cool because it uses XML, and I have to make my users happy!

PS: And you know what? The new dm application server by SpringSource uses JSON for configuration files!!! I told you so ;-)

* "The creator of Ant excercising hit own daemons", this article was originally stored under, but this link seems to be dead, so I'll host it on my website for a while (hope it's OK...):

** as a matter of fact, he's a Ruby programmer too, so he isn't that unpartial:

***taken from the book "Beyond Java" by Bruce Tate:, unfortunately the original link stated there isn't working: Justin Gehtland, Weblogs for Relevance, LLC (April 2005); "I *heart* rails; Some Numbers at Last."

**** "Wiring The Observer Pattern with Spring":, and "XSL Transformations. A delivery medium for executable content over the Internet" in DDJ from April 05, 2007: . I cite from the latter:

XIM is an XML-based programming language with imperative control features, such as assignment and loops, and an interpreter written in XSLT. XIM programs can thus be packaged with an XIM processor (as an XSLT stylesheet) and sent to the client for execution.
See, I wasn't joking...

Saturday, 4 April 2009

Some distributed computing news: cuddly elephans, pigs, nymphs and couch computing

As I did some distributed computing research in my previous life, and then tried to get involved with Google, I'm still somehow interested in the subject. If you are into distributed computing, you'd know that Google is using a distributed C++ system called MapReduce* for indexing of the crawled webpages. It happens that there's an open-source Java implemenation of Google's system (based on published Google's research papers) called Hadoop . Up to now I didn't know about any serious usage of Hadoop in big distributed systems, but as it comes, I learned that Yahoo! is using Hadoop internally** for spam detection!

Yahoo's researchers have also designed a new language called PIG Latin (?!)*** for distributed programming on the Hadoop platform. The idea is rather an interesting one: they pointed out that the existing way of writing programs for the mapper and reducer part of a distributed application is too low-level and not reusable, but somehow familiar to programmers. In contrast, a declarative language like SQL poses some problems, so the paradigm there is completely diffrent:

At the same time, the transformations carried out in each step are fairly high-level, e.g.,filltering, grouping, and aggregation, much like in SQL. The use of such high-level primitives renders low-level manipulations (as required in map-reduce) unnecessary.

To experienced system programmers, this method is much more appealing than encoding their task as an SQL query, and then coercing the system to choose the desired plan through optimizer hints.
So it provides a middle ground, where we can avoid SQL and write down the computation steps, but can do that at higher level!

Unsuprisingly, Microsoft already has a proprietary distributed computing platform called Dryad, which tries to do something like this as well, but is using C# and LINQ (i.e. an SQL-like query language):

DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.

But I don't know how much this is a research project and if it's really used internally. On the other side, Microsoft recently aquired a search company called Powerset, which is using Hadoop at its core, and as the gossip runs, should be used to bring Microsoft's LiveSearch up to speed.

Facebook, in their turn, are using Hadoop, but added a Business Intelligence tool called Hive (now released as open source!) on top of it. It uses an SQL-like query language, as BI-users understand it best. Beneath the surface Hive leverages Hadoop and translates SQL-like imperatives into MapReduce jobs.

Not bad, cuddly elephant! It seems everyone is loving you!

Accidentally, when speaking about SQL and non-SQL: there's a non-SQL database by Apache called Couch DB. It is written in Erlang (we reported already**** ;-)) and implements a distributed document database. It's designed for lock-free concurrency using Erlang's "share-nothing" philosophy. To my mind it resembles Linus' Thorvalds git source configuration system, as it holds all the versions of documents determining the "winner" to be visible in defined views. Hear, hear! Erlang's alive, distributing and kicking!

Another thing worth noticing is that's not only SQL anymore, there are attempts to do things differently, athough others are sticking with the ol 'n reliable. Paradigm change on its way? It's interesting times I think.

PS (6 Okt. 2009): Look, look, now even the Big Blue has jumped the wagon too! They offer the M2 enterprise data analysis platform, based on Hadoop and using PIG. See: So it seems Hadoop is definitely mainstream and business analyst thingie now, and as such it has lost much of its appeal, as far as I'm concerned.

* Jeffrey Dean, Sanjay Ghemawat: "MapReduce: Simplified Data Processing on Large Clusters", OSDI 2004,
** unfortunately it's a two-parts video:
*** Christopher Olston et al: "Pig Latin: A Not-So-Foreign Language for Data Processing", SIGMOD 2008,