Saturday 4 April 2009

Some distributed computing news: cuddly elephans, pigs, nymphs and couch computing

As I did some distributed computing research in my previous life, and then tried to get involved with Google, I'm still somehow interested in the subject. If you are into distributed computing, you'd know that Google is using a distributed C++ system called MapReduce* for indexing of the crawled webpages. It happens that there's an open-source Java implemenation of Google's system (based on published Google's research papers) called Hadoop . Up to now I didn't know about any serious usage of Hadoop in big distributed systems, but as it comes, I learned that Yahoo! is using Hadoop internally** for spam detection!

Yahoo's researchers have also designed a new language called PIG Latin (?!)*** for distributed programming on the Hadoop platform. The idea is rather an interesting one: they pointed out that the existing way of writing programs for the mapper and reducer part of a distributed application is too low-level and not reusable, but somehow familiar to programmers. In contrast, a declarative language like SQL poses some problems, so the paradigm there is completely diffrent:

At the same time, the transformations carried out in each step are fairly high-level, e.g.,filltering, grouping, and aggregation, much like in SQL. The use of such high-level primitives renders low-level manipulations (as required in map-reduce) unnecessary.
...

To experienced system programmers, this method is much more appealing than encoding their task as an SQL query, and then coercing the system to choose the desired plan through optimizer hints.
So it provides a middle ground, where we can avoid SQL and write down the computation steps, but can do that at higher level!

Unsuprisingly, Microsoft already has a proprietary distributed computing platform called Dryad, which tries to do something like this as well, but is using C# and LINQ (i.e. an SQL-like query language):

DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.

But I don't know how much this is a research project and if it's really used internally. On the other side, Microsoft recently aquired a search company called Powerset, which is using Hadoop at its core, and as the gossip runs, should be used to bring Microsoft's LiveSearch up to speed.

Facebook, in their turn, are using Hadoop, but added a Business Intelligence tool called Hive (now released as open source!) on top of it. It uses an SQL-like query language, as BI-users understand it best. Beneath the surface Hive leverages Hadoop and translates SQL-like imperatives into MapReduce jobs.


Not bad, cuddly elephant! It seems everyone is loving you!


Accidentally, when speaking about SQL and non-SQL: there's a non-SQL database by Apache called Couch DB. It is written in Erlang (we reported already**** ;-)) and implements a distributed document database. It's designed for lock-free concurrency using Erlang's "share-nothing" philosophy. To my mind it resembles Linus' Thorvalds git source configuration system, as it holds all the versions of documents determining the "winner" to be visible in defined views. Hear, hear! Erlang's alive, distributing and kicking!

Another thing worth noticing is that's not only SQL anymore, there are attempts to do things differently, athough others are sticking with the ol 'n reliable. Paradigm change on its way? It's interesting times I think.

PS (6 Okt. 2009): Look, look, now even the Big Blue has jumped the wagon too! They offer the M2 enterprise data analysis platform, based on Hadoop and using PIG. See: http://www.sdtimes.com/link/33808. So it seems Hadoop is definitely mainstream and business analyst thingie now, and as such it has lost much of its appeal, as far as I'm concerned.

---
* Jeffrey Dean, Sanjay Ghemawat: "MapReduce: Simplified Data Processing on Large Clusters", OSDI 2004, http://labs.google.com/papers/mapreduce.html
** unfortunately it's a two-parts video: http://developer.yahoo.net/blogs/hadoop/2009/03/using_hadoop_to_fight_spam_-_part_1.html
*** Christopher Olston et al: "Pig Latin: A Not-So-Foreign Language for Data Processing", SIGMOD 2008, http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf
**** http://ib-krajewski.blogspot.com/2007/08/erlangs-change-of-fortunes.html

1 comment:

Anonymous said...

Email Marketing
very useful, thanx a lot for this blog .... This was exactly waht I waw looking for.