Hadoop is the leading contender to enable organizations to economically and incrementally take advantage of distributed storage and scalable distributed processing to tackle the Big Data challenges ahead. The days of buying expensive vertically scaling servers and expensive storage systems are over. Hadoop started from the humble beginnings of Map Reduce and distributed storage (HDFS) and now it has expanding to touch and integrate with all corners of the enterprise computing fabric from real-time business intelligence to ETL and data warehousing. These days, most any company with some kind of database or software analytics solution has now put the word "Big" in their title and offer some level of Hadoop integration. Nothing really bad about that, and it is great to see everyone gravitating to the Hadoop ecosystem as an open source standard of sorts for Big Data.
Hadoop presents a lot potential to solve problems that in the past required much more expensive and proprietary systems. Note, that Hadoop in many respects is no less complex (and is by no means free) from past and existing propriety Big Data platforms, as Hadoop has its own complexity challenges such as many distributed hardware moving parts and is a more or less a loose collections of many open source projects. Hadoop has a lot of creative minds and companies driving its fast evolution. But it is not out of the box a plug and play solution nor a one size fits all solution by any stretch of the imagination. Hadoop does not come cheap by any measure, but with Hadoop you have more opportunity to grow your Big Data system as you go, and with the potential with less vendor lock-in and more flexibility over what you pay for (note, I use the world potential here). The value you get out of Hadoop depends on your expectations and on your investment in people and training along with key decisions you make along the way.
So how does an organization begin down the road of figuring out how Hadoop fits into their existing ecosystem and how much and how fast to invest in Hadoop? Let's see if we can walk through some common questions, challenges and experiences one would go through as they begin their Hadoop quest.
First you need to understand what makes Hadoop tick.
It is important to understand that out of the gate Hadoop does not necessarily invent anything that has not existing before in other products. There are some novel concepts in Hadoop, but overall Hadoop offers nothing altogether new. There are some cool innovations in Hadoop, but fundamentally Hadoop is about a few key concepts. It is founded on the concept of distributed computing and distributed storage using commodity hardware. But ultimately Hadoop is about growing your data storage and processing in an incremental and economical way using largely open source technology and off the shelf hardware. Note, open source does not mean free of course.
Okay, so what problem do we want to solve with Hadoop? Please don't say all of them.
One of the nice things about Hadoop is that organizations of any size can adopt it. You can be a small startup with and simple idea and run your Hadoop on a small clusters on Amazon or you can be a larger enterprise and have a massive clusters performing high-end processing, such as crawling and indexing the entire web. Hadoop can be used in a variety of situations such as to reliably store large volumes of data on commodity storage or it can be used for much more complex computing, ETL, NoSQL and analytical processing.
For larger organizations that are getting started with Big Data, it is vital to identify some key problems you want solved with Hadoop and that might fit and integrate well with existing legacy systems. Hadoop is particularly good at being a holding area for unstructured data like web or user logs that you might want to keep in raw format for later analysis and auditing, for example. What is typically important is to start small and solve some specific problems on specific data sets and then expand your application of Hadoop as you go. This includes getting accustomed to the many programming and DSL packages that can be used to process Hadoop data.
Hey, in a Big Data universe we never throw anything away.
Some of the talk circling around Big Data often mentions how the typical application of Hadoop is to always store everything forever. Obviously this is not practical. Now, many vendors that are providing software and hardware for Hadoop would love for you try to do this, but the reality is that you still need to understand your data limits and have clear aging and time to live policies. Hadoop does let you scale your storage out to petabytes, potentially, but there is no free lunch here. Also, a critical aspect to this is understanding the format you store your data in, within Hadoop. Again here, you hear a lot of talk about storing all your data in "raw format" so you can have all the details in order to extract deep information form your data in the future. While this sounds great in theory, again this is not practical in most cases. In reality, you can keep some data in raw format, but you must typically transform your Hadoop data in other formats besides just unstructured HDFS sequence files, for example. Structure does matter as you get into more complex analytics in Hadoop. Storing your data in HDFS also often means transforming it into semi-structured column stores for use by tools such as Hive and HBase and other query engines, for better performance. So structure matters and expect to have your data stored in Hadoop in possibly multiple formats or at least transformed via Hadoop based ETL into formats other than the "raw" acquisition format. This all adds up to more and more storage requirements. So make sure you understand the math to properly size your Hadoop storage needs.
Now this software is open source which means mostly free, right?
Obviously we have all learned by now that open source does not necessary mean free. Red Hat, as an example, has a pretty good business around open source and they are quite successful at making a profit. Hadoop vendors are no different. There are several well funded start-ups that have Red Hat like business models around Hadoop, not to mention all the big boys trying to retrofit their existing Big Data solutions to be Hadoop friendly. None of them are free, but they all are different from each other. And it is important to understand each Hadoop vendor's strengths and weakness and where they are coming from. The vendor's history does matter for a lot of reasons that I will discuss in a later post.
Now, in theory you could go it alone, and use Hadoop completely free - just download most of the Hadoop packages from Apache (and a few other places). For example, I have downloaded and installed versions of Hadoop from the Apache Foundation and have been ale to run basic Map Reduce and HDFS jobs running on small clusters - all for free and without going through any Hadoop vendors. You can also use community versions from the various Hadoop distributions from the major Hadoop vendors. This can work, but you are on your own and how feasible this approach is depends who you are and how savvy your technical staff are. It is also important to understand how the various Hadoop distributions and players differ from each other and how much you are getting "locked in" with each Hadoop vendor. The retro-fitted Hadoop vendors (as I call them) have a lot more polish and savvy when they pitch Hadoop to you while some of the Hadoop startup vendors have varying degree's of proprietary software embedded in their Hadoop distributions. It is critical to understand these facts and it is important to consider how much you are willing to build on top of Hadoop yourself vs relying 100% on your Hadoop partner. These are important considerations that can sometimes get lost in internal management jockeying over who will be the Big Data boss. Vendor lock-in is very important to understand along with clearly planning for sizing, capacity and long-term incremental growth of your cluster.
This all leads to understanding the cost of Hadoop as you set expectations over what problems you want your Hadoop cluster to solve from day one. Sizing your Hadoop cluster for storage, batch computing, real-time analytics/streaming, and data warehousing must be considered. How you capacity plan your storage, HDD spindles, and cpu cores are critical decisions as you plan the nuts and bolts of your Hadoop cluster. Your Hadoop partner/vendor can help you with this sizing and planing, but again here, each vendor will approach it differently depending on who they are and who you are (how deep your pockets are). You have to be smart here and know what is in your best interest long-term.
Your Hadoop cluster is not an island.
It is vital to consider how your Hadoop cluster will fit in with your current IT environment and existing data warehousing and BI environments. Hadoop will typically not totally replace your existing ETL, data warehousing and BI systems. In many cases, it will live alongside existing BI systems. It is also vital to understand how you will be moving data efficiently into your Hadoop cluster and how much processing and storage is needed to put data into intermediate formats for optimal performance and efficient consumption by applications. These are critical questions to answer in order to get your Hadoop cluster running efficiently to effectively feed downstream systems.
You mean my Hadoop cluster does not run itself?
One under estimated area concerning Hadoop, is planning for the operations and on-going management of your Hadoop cluster. Hadoop is good technology, but is fast evolving and has many move parts both at an infrastructure level (lot of nodes and HDDs) and from software package perspective (lot of software packages that are fast evolving). This makes running, monitoring and upgrading/patching Hadoop a non-trivial task. For example, many of the Hadoop vendors offer both open source and proprietary solutions for managing and running your clusters. This obviously requires your operations and production IT staff to be included in the planning and management of your clusters.
Some other important questions and considerations as you get started with Hadoop.
- How will multi-tenancy and sharing work if more than one group is going to be using your cluster.
- Should I have one or a few big Hadoop clusters, or many small clusters
- Understand your storage, processing, and concurrency needs. Not all Hadoop schedulers are created equal for all situations.
- Do you need or want to leverage virtualization and or cloud bursting?
- Choose your hardware carefully to keep costs per TB low. How to mange TB vs cpu/core is important.
- Understand what you need in your edge nodes for utility and add-on software.
- Plan your data acquisition and export needs between your Hadoop cluster and the rest of your ecosystem.
- Understand your security needs at a data and functional level.
- What are your up time requirements? Plan for rolling patches and upgrades.
Maybe I should have stated this in the beginning, but the reason I called this blog Protecting your Hadoop Investment, is because many organizations enter into this undertaking without a clear understand of:
- Why they are pursuing Big Data (other than it is the hot thing to do).
- How Hadoop differs from past propriety Big Data solutions.
- How it can fit along side existing legacy systems.
- How to ultimately manage costs and expectations at both a management and technical level.