Tuesday, February 7, 2012

Native Multi-Tenant Hadoop - Big Data 2.0

For Hadoop to gain wider adoption and lower the barrier of entry to a broader audience it must become much more economical for businesses of all sizes to manage and operate a Hadoop processing cluster. Right now it takes a significant upfront investment in hardware and IT knowhow to provision the hardware and the necessary IT admin skills to configure and manage a full blown Hadoop cluster for any significant operation.

Cloud services like Amazon Elastic Map Reduce help reduce some of this but they can quickly become costly if you need to do seriously heavy processing and especially if you need to manage data in HDFS as opposed to constantly moving it between your HDFS cluster and S3 in order to shutdown datanodes to save cost as is the standard with Amazon EMR. Utilities like Whirr also help push the infrastructure management onto the EC2 cloud but again here for serious data processing this can quickly become cost prohibitive.

Operating short lived Hadoop clusters can be q useful option, but many organizations need long running processing and need to leverage HDFS for longer-term persistence as opposed to just a transient storage engine during the lifespan of MapReduce processing as is the case of Amazon EMR. For Hadoop, and Big Data in general, to make the next evolutionary leap for the boarder business world, we need a fully secure and multi-tenant Hadoop platform. In such as multi-tenant environment organizations can share clusters securely and manage the processing load in very controllable ways. And also allow each tenant to customize their Hadoop job flows and code in an isolated manner.

Hadoop already has various capacity management scheduling algorithms but what is needed is higher order resources management that can full isolate between different organizations for HDFS security and data processing purposes to support true multi-tenant capability. This will drive wider adoption within large organizations and by infrastructure services providers because it will increase the efficient utilization of unused CPU and storage just in same way that SaaS has allowed software to achieve greater economies of scale and services and democratize software for small and big organizations alike.

Native multi-tenant support in Hadoop will drastically reduce the upfront cost of rolling out a Hadoop environment and make the long-term costs much more cost effective and open the door for Hadoop and Big Data solutions to go mainstream in much the same way that Salesforce, for example, has created a rich ecosystem of solutions around business applications and CRM. This will also allow organizations to keep long-running environments and keep their data in HDFS for longer periods of time allowing them be more creative and spontaneous.

No comments: