Tuesday, September 18, 2012

The Big Data Evolution Will Continue - No Kidding

Big Data is very much about discovering information locked in your mountains of data that come out of your production center, IT operations, enterprise systems, and back office databases. Information is all in the eye of the beholder so one person's junk is another person's gold. These days with the volumes of social data and device data growing at astronomical levels there is a lot of data to sift through and make sense out of.

While it is true that the more data you can capture the more possible information to discover there is a limit to this. I think we are going through a cycle where capturing and trying to make sense out of vast volumes of data (social data, sensor data....etc) is becoming more economical and somewhat mainstream with respect to technology and tools. However, this is a cyclic I believe, at some point business will realize that maybe they are getting diminishing returns on all this data they are capturing and storing. For example, do I really care what I tweeted 20 years ago (20 years from now). I probably will never have the time to go back and look at it and I am not sure it is valuable to any marketing person (but who knows).

There is definitely gold to be mined in many data sets that now go untapped and technologies like Hadoop, BigQuery, Storm to name a few are good tools to use but not everything fits into the Big Data tent either.

There has been a lot of hype around Big Data these days and I see a lot of people trying to fit problems that really have no reason being shoehorned into Hadoop, other than it being the cool thing to do. You could do the data crunching in easier ways for example. However, the tool sets are expanding to give developers, scientist and business people more options when deciding how to store and analyze their data.

When thinking of Big Data first ask yourself the following question:

1) How much data do I want to capture and store (do you need to persist detailed records/data?)
2) How fast is this data being created (velocity).
3) How long do I want to keep it (forever?).
4) How long am I willing to wait to get "information" when I run my analysis (batch/hourly/daily or real-time).
5) What will cost me to keep all this data around and do I have the system admin muscle to do this?

This might help you determine in which of the particular emerging Big Data technology buckets your problem best fits and which approach to take (cloud cluster, on-premises cluster...etc).

No comments: