Recently Hadoop posted a HadoopVsGridgain comparison page on Wiki. I have always been a big fan of Hadoop. Although I believe that the product is very hard to use and API's are far from obvious, I still think they have achieved quite a lot and the fact that Yahoo Search runs on Hadoop proves that system works and scales quite well. However, this ridiculous "comparison" threw me a bit off and the only reason I can think they put it up is that GridGain started significantly cutting into their user base.
Generally such comparisons from vendors look plain silly. Nobody expects a vendor to be fair when talking about a perceived competitor and needless to say many points made on that page are wrong. Moreover, the main differences between the two products are not even touched!
Hadoop comes with distributed Hadoop File System (HFS) which is its main feature. It also comes with a MapReduce component which allows you to work with files stored on HFS in parallel. HFS is extremely performant and allows for scalable and fast processing of data that is stored on it. It is great for applications that can afford putting all of their data into HFS (Yahoo Search), however it is not at all suited for a vast majority of applications that use conventional databases such as Oracle or MySql.
GridGain, on the other hand, is a MapReduce computational grid platform, the main feature of which is to split a task into smaller jobs and execute them in parallel on the grid. It handles node discovery and communication, peer class loading, scheduling and job collision resolution, load balancing, data affinity, transparent grid-enabling via AOP, and many other computational features out of the box. GridGain does not come with any file system of its own, but integrates with all major data grid products to provide collocation of data and computations - this is how GridGain is able to process terabytes of data stored in any database or file system.
So, the main difference between GridGain and Hadoop is that Hadoop forces you to migrate all of your data into their proprietary Hadoop File System and GridGain does not and instead allows you to work with your existing databases.
I also want to add that GridGain is by far much simpler to use, and the API's it provides are more natural and easier to understand, but take this with a grain of salt as I am definitely biased here :)



3 comments:
I'm sorry that you feel that my comparison is "ridiculous." If you have any points that factually wrong, please let me know and I'll fix them. I did pull out the comment about Hadoop being free while GridGain costs $20,000/year on 6 cpus, because it was pointed out that was for GridGain's support. While Hadoop's community support is free, commercial support for either framework does cost money.
Hadoop absolutely does not require that your data be put into HDFS. There are bindings to read data from files, which includes NFS, Amazon's S3, or KFS. There are also input and output formats that read and write from/to HBase servers. It is completely pluggable using user code.
And no, I'm not worried about GridGain stealing our customer-base. (As an open source project being implemented by a diverse community of contributors, we don't think of it as a customer-base, because that implies you are selling them something. We have users...) For most of the applications that use Hadoop, GridGain is not at all appropriate.
An extremely biased perspective of the relatively impact is given by looking at mentions in markmail's archives of open source email lists: GridGain
and Hadoop
Hi Owen,
Yes, you did clean up the initial version of the comparison, which is a step forward.
You mention that you work with other file systems as well, but how about normal Oracle database users? What if the data is stored in a normal database?
I agree that GridGain is not appropriate for most applications that use Hadoop. The same works the other way around as well. But in this case why even put up such a silly comparison that has nothing to do with real differences? (we, btw, never got a single request about hadoop from any of our users or clients).
And finally, about the search links you put up. The "impact" or the number of Hadoop search hits comes almost exclusively from every individual comment on every Hadoop Jira ticket. Plus you guys got every source class indexed on google for all mirror download sites :) Remove that - and it's not that many.
Actually, I looked around the Hadoop jira, and someone did do a Hadoop input format for mysql, in Hadoop-2356 . I should work with the author to finish up the patch...
Post a Comment