Wednesday, February 5, 2014

Machine Learning: The Brains Behind Big Data

The first round of the data revolution has focused around commoditizing computing and storage. Platforms such as Hadoop and NoSQL have helped to propel this and have enabled businesses to economically deploy more powerful scale out infrastructure than before. It has also changed and improved the way data warehousing and business intelligence is approached and managed. The storage and performance capabilities of Big Data have been a game changer. Traditional descriptive BI and reporting will never be the same. But this is just step one. The best is yet to come.

The industry is now going through a learning processes with how to manage all this data at massive scales. Storing and managing more data is great, but people and businesses will get smarter at how much data to keep as it starts to hurt more (hurt the pocketbook). How much data you keep and mine will depend on statistically driven best practices and not just about data warehousing or how big your HDFS cluster is. The mainstreaming of Big Data has provided the muscle to store and process massive amounts of data at near linear scale, but we will not see the real value of all this Big Data storage and processing until machine learning and data science tools become more assessable (to the non-PHD data scientists among us) and mainstream and businesses learn how to apply these tools and disciplines effectively.

Machine Learning will provide the brains to go along with the Big Data muscle. In the long-run businesses will decide how much data to keep around based on statistical measures and best practices as they grow to understand their data and their business better as they build out developing their predictive and prescriptive analytics.

Sunday, October 20, 2013

JobServer and Mesos Make a Great Pair

We are happy to announce the release of JobServer 3.6 beta1 with support for Mesos clustering and distributed job processing. Release 3.6 is an early access release of JobServer with integrated support for Mesos. With this release of JobServer, you can now schedule and run jobs on a Mesos cluster of any size and configuration. Say goodbye to cron jobs!

JobServer has always had support for distributed job scheduling and processing and a great replacement for cron. Now, with Mesos integration, JobServer takes this to next level by incorporating support for dynamic resource management and reliability by leveraging all the advantages of Mesos. JobServer also brings powerful scheduling, reporting and monitoring features to Mesos environments. Distributed job scheduling and batch processing just got more interesting!

With this release you can track and manage jobs as they run across a dynamic and highly resilient cluster of servers. JobServer with Mesos allows you to run scripts and jobs across your cluster of servers and manage how resources are utilized and managed. If you are a Mesos user today, give JobServer a try and say goodbye to cron. If you are a JobServer user, get your compute resources under control with Mesos.

Download the beta release of JobServer v3.6 and tame your IT environment using all the advantages of Mesos and JobServer.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.

Friday, October 18, 2013

Mesos: The Datacenter is the Computer

The data center is the computer. The pendulum is swinging. Traditional cloud and virtualization level resource management in the data center are no longer good enough to efficiently manage the growing demands for computing services needed in the enterprise. The answer for this challenge, to offer more compute and storage services more efficiently, are solutions such as Mesos and YARN. These emerging cluster management platforms are the next evolution for fine grained and efficient resource management of your data center infrastructure and services. As the need for more processing and storage grows, solutions like YARN and Mesos take center stage.

Big Data, mobile, and cloud computing have driven a tremendous amount of growth and innovation, but the byproduct has been more and more computing infrastructure needed to service the growth and manage the explosion of data. This has especially been the case as we have moved to using more clustered commodity hardware and distributed storage. You now have start-ups and smaller companies managing complex multi-node computing infrastructure for things like Hadoop, real-time event streaming, social graphs as well as for managing established core services like data warehousing, ETL and batch processing. All this has put a lot of demands in order to effectively manage and administrator a dynamic hardware computing environment and in many cases it has created isolated silos of resources dedicated to different tasks, for example, your Hadoop cluster is separate from your application services, database servers and legacy batch processing. This does not scale and it not cost effective.

These silos have created inefficiencies within that data center and the enterprise environment. For example, if your Hadoop cluster of 10 nodes is running only 70% of the time at maximum capacity, what are those 10 nodes doing the other 30% of the time? Same can be said for the other services running in the data center. Unless you can treat your entire data center as one shared cluster of resources, you will have inefficiencies and as the number of nodes and services you are managing grows, these inefficiencies will only increase. This is were solutions like Mesos can step in and give your applications and services one holistic view of your computing infrastructure. By using Mesos, you can reduce costs and more efficiently utilize the hardware and storage resources you already have and it allows you to grow more incrementally as more resources are needed.

Companies like Google, Twitter and Facebook are leading the charge to advance the state of art for efficient data center and enterprise computing. Mesos is a great tool and platform to leverage to reduce costs, improve reliability and overall operational efficiency of your operational IT environment. Give Mesos a look. Cheers!

Thursday, August 15, 2013

Protecting your Hadoop Investment

The hype and buzz around Big Data in the tech industry is at astronomical levels. There are many factors driving this (both technical and human), but I won't get into that here. The fact is that Big Data (define it as you like) is here to stay and many organizations need to find their path to the wonderland of bottomless data storage and boundless analytical computing where no byte of data is every thrown away and any question can be asked and answered about your data. Well, at least if you are Facebook or Google.

Hadoop is the leading contender to enable organizations to economically and incrementally take advantage of distributed storage and scalable distributed processing to tackle the Big Data challenges ahead. The days of buying expensive vertically scaling servers and expensive storage systems are over. Hadoop started from the humble beginnings of Map Reduce and distributed storage (HDFS) and now it has expanding to touch and integrate with all corners of the enterprise computing fabric from real-time business intelligence to ETL and data warehousing. These days, most any company with some kind of database or software analytics solution has now put the word "Big" in their title and offer some level of Hadoop integration. Nothing really bad about that, and it is great to see everyone gravitating to the Hadoop ecosystem as an open source standard of sorts for Big Data.

Hadoop presents a lot potential to solve problems that in the past required much more expensive and proprietary systems. Note, that Hadoop in many respects is no less complex (and is by no means free) from past and existing propriety Big Data platforms, as Hadoop has its own complexity challenges such as many distributed hardware moving parts and is a more or less a loose collections of many open source projects. Hadoop has a lot of creative minds and companies driving its fast evolution. But it is not out of the box a plug and play solution nor a one size fits all solution by any stretch of the imagination. Hadoop does not come cheap by any measure, but with Hadoop you have more opportunity to grow your Big Data system as you go, and with the potential with less vendor lock-in and more flexibility over what you pay for (note, I use the world potential here). The value you get out of Hadoop depends on your expectations and on your investment in people and training along with key decisions you make along the way.

So how does an organization begin down the road of figuring out how Hadoop fits into their existing ecosystem and how much and how fast to invest in Hadoop? Let's see if we can walk through some common questions, challenges and experiences one would go through as they begin their Hadoop quest.

First you need to understand what makes Hadoop tick.
It is important to understand that out of the gate Hadoop does not necessarily invent anything that has not existing before in other products. There are some novel concepts in Hadoop, but overall Hadoop offers nothing altogether new. There are some cool innovations in Hadoop, but fundamentally Hadoop is about a few key concepts. It is founded on the concept of distributed computing and distributed storage using commodity hardware. But ultimately Hadoop is about growing your data storage and processing in an incremental and economical way using largely open source technology and off the shelf hardware. Note, open source does not mean free of course.

Okay, so what problem do we want to solve with Hadoop? Please don't say all of them.
One of the nice things about Hadoop is that organizations of any size can adopt it. You can be a small startup with and simple idea and run your Hadoop on a small clusters on Amazon or you can be a larger enterprise and have a massive clusters performing high-end processing, such as crawling and indexing the entire web. Hadoop can be used in a variety of situations such as to reliably store large volumes of data on commodity storage or it can be used for much more complex computing, ETL, NoSQL and analytical processing.

For larger organizations that are getting started with Big Data, it is vital to identify some key problems you want solved with Hadoop and that might fit and integrate well with existing legacy systems. Hadoop is particularly good at being a holding area for unstructured data like web or user logs that you might want to keep in raw format for later analysis and auditing, for example. What is typically important is to start small and solve some specific problems on specific data sets and then expand your application of Hadoop as you go. This includes getting accustomed to the many programming and DSL packages that can be used to process Hadoop data.

Hey, in a Big Data universe we never throw anything away.
Some of the talk circling around Big Data often mentions how the typical application of Hadoop is to always store everything forever. Obviously this is not practical. Now, many vendors that are providing software and hardware for Hadoop would love for you try to do this, but the reality is that you still need to understand your data limits and have clear aging and time to live policies. Hadoop does let you scale your storage out to petabytes, potentially, but there is no free lunch here. Also, a critical aspect to this is understanding the format you store your data in, within Hadoop. Again here, you hear a lot of talk about storing all your data in "raw format" so you can have all the details in order to extract deep information form your data in the future. While this sounds great in theory, again this is not practical in most cases. In reality, you can keep some data in raw format, but you must typically transform your Hadoop data in other formats besides just unstructured HDFS sequence files, for example. Structure does matter as you get into more complex analytics in Hadoop. Storing your data in HDFS also often means transforming it into semi-structured column stores for use by tools such as Hive and HBase and other query engines, for better performance. So structure matters and expect to have your data stored in Hadoop in possibly multiple formats or at least transformed via Hadoop based ETL into formats other than the "raw" acquisition format. This all adds up to more and more storage requirements. So make sure you understand the math to properly size your Hadoop storage needs.

Now this software is open source which means mostly free, right?
Obviously we have all learned by now that open source does not necessary mean free. Red Hat, as an example, has a pretty good business around open source and they are quite successful at making a profit. Hadoop vendors are no different. There are several well funded start-ups that have Red Hat like business models around Hadoop, not to mention all the big boys trying to retrofit their existing Big Data solutions to be Hadoop friendly. None of them are free, but they all are different from each other. And it is important to understand each Hadoop vendor's strengths and weakness and where they are coming from. The vendor's history does matter for a lot of reasons that I will discuss in a later post.

Now, in theory you could go it alone, and use Hadoop completely free - just download most of the Hadoop packages from Apache (and a few other places). For example, I have downloaded and installed versions of Hadoop from the Apache Foundation and have been ale to run basic Map Reduce and HDFS jobs running on small clusters - all for free and without going through any Hadoop vendors. You can also use community versions from the various Hadoop distributions from the major Hadoop vendors. This can work, but you are on your own and how feasible this approach is depends who you are and how savvy your technical staff are. It is also important to understand how the various Hadoop distributions and players differ from each other and how much you are getting "locked in" with each Hadoop vendor. The retro-fitted Hadoop vendors (as I call them) have a lot more polish and savvy when they pitch Hadoop to you while some of the Hadoop startup vendors have varying degree's of proprietary software embedded in their Hadoop distributions. It is critical to understand these facts and it is important to consider how much you are willing to build on top of Hadoop yourself vs relying 100% on your Hadoop partner. These are important considerations that can sometimes get lost in internal management jockeying over who will be the Big Data boss. Vendor lock-in is very important to understand along with clearly planning for sizing, capacity and long-term incremental growth of your cluster.

This all leads to understanding the cost of Hadoop as you set expectations over what problems you want your Hadoop cluster to solve from day one. Sizing your Hadoop cluster for storage, batch computing, real-time analytics/streaming, and data warehousing must be considered. How you capacity plan your storage, HDD spindles, and cpu cores are critical decisions as you plan the nuts and bolts of your Hadoop cluster. Your Hadoop partner/vendor can help you with this sizing and planing, but again here, each vendor will approach it differently depending on who they are and who you are (how deep your pockets are). You have to be smart here and know what is in your best interest long-term.

Your Hadoop cluster is not an island.
It is vital to consider how your Hadoop cluster will fit in with your current IT environment and existing data warehousing and BI environments. Hadoop will typically not totally replace your existing ETL, data warehousing and BI systems. In many cases, it will live alongside existing BI systems. It is also vital to understand how you will be moving data efficiently into your Hadoop cluster and how much processing and storage is needed to put data into intermediate formats for optimal performance and efficient consumption by applications. These are critical questions to answer in order to get your Hadoop cluster running efficiently to effectively feed downstream systems.

You mean my Hadoop cluster does not run itself?
One under estimated area concerning Hadoop, is planning for the operations and on-going management of your Hadoop cluster. Hadoop is good technology, but is fast evolving and has many move parts both at an infrastructure level (lot of nodes and HDDs) and from software package perspective (lot of software packages that are fast evolving). This makes running, monitoring and upgrading/patching Hadoop a non-trivial task. For example, many of the Hadoop vendors offer both open source and proprietary solutions for managing and running your clusters. This obviously requires your operations and production IT staff to be included in the planning and management of your clusters.

Some other important questions and considerations as you get started with Hadoop.
  • How will multi-tenancy and sharing work if more than one group is going to be using your cluster.
  • Should I have one or a few big Hadoop clusters, or many small clusters
  • Understand your storage, processing, and concurrency needs. Not all Hadoop schedulers are created equal for all situations.
  • Do you need or want to leverage virtualization and or cloud bursting?
  • Choose your hardware carefully to keep costs per TB low. How to mange TB vs cpu/core is important.
  • Understand what you need in your edge nodes for utility and add-on software.
  • Plan your data acquisition and export needs between your Hadoop cluster and the rest of your ecosystem.
  • Understand your security needs at a data and functional level.
  • What are your up time requirements? Plan for rolling patches and upgrades.

Maybe I should have stated this in the beginning, but the reason I called this blog Protecting your Hadoop Investment, is because many organizations enter into this undertaking without a clear understand of:
  1. Why they are pursuing Big Data (other than it is the hot thing to do).
  2. How Hadoop differs from past propriety Big Data solutions.
  3. How it can fit along side existing legacy systems.
  4. How to ultimately manage costs and expectations at both a management and technical level.
If you do not understand these points, then you will waste a lot of time and money and fail to take effective advantage of Hadoop. So, strap in and enjoy your Hadoop and Big Data adventure. It will be a journey as much as a destination and it will transform your organization for the better if you plan appropriately and enter into it with your eyes wide open.

Thursday, July 25, 2013

JobServer 3.4.28 - Isolated JVM Containers

We are happy to announce the release of JobServer 3.4.28 which adds a number of new features for administrators along with supporting the latest version of Google Web Toolkit and expanded remote management APIs.

With this release, JobServer now supports expanded remote web services programatic APIs. Also included in this release is the capability to run distributed jobs under customizable Linux/Unix userspace accounts on a job by job basis, which gives administrators fined grained control over how they run their jobs. This allows users to run jobs inside isolated JVMs in a more granular fashion.

It has always been our focus to make JobServer the most developer and IT friendly scheduling and job processing platform on the planet. We are proud of our focus on taking customer and developer feedback to continuously make JobServer the best scheduling and job processing engine around. JobServer tames your job processing and scheduling environment in a way that is a joy for Java developers to customize upon while providing powerful web UI management and administration features for business users and IT operations administrators.

Download and test drive JobServer 3.4.28 today and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.

Saturday, June 1, 2013

Big Data is More Than Correlation and Causality

There is no discounting that the Big Data movement is getting a lot of attention from all avenues of business and technology. Large scale computing has been around for decades, since the days of super computers, and has been brought to the forefront of late by the high flying internet companies. This has been driven in part by significant advances in the availability of commodity hardware, open source distributed computing software, cloud computing, and virtualization among other things.

A lot of the debate as to the value and benefits of Big Data is largely centered around how it can benefit companies in analyzing large data sets to help them make marketing type decisions such as recommending what movie or product you should buy and thus improve the bottom line of these businesses. There are also other applications such as the analysis of vast volumes of sensor or transactional data in order to find patterns using machine learning. The possibilities for applying Big Data are abound for both analyzing structured and unstructured data in order to extract information and improve marketing and overall business decision making.

Correlation vs Causality
One common debate about Big Data is the effectiveness of the analytics applied in Big Data solutions, and whether it really can discover answers to questions or is it just better suited for correlations and not necessarily best suited in identifying precise causality. These debates are good discussion to have and in general I think Big Data can serve many purposes from finding correlations to solving very specific problems from a wide spectrum of data sources. The ability to extract value from Big Data is driven in part by the volume of data available and applying the right machine learning algorithms. However, I believe there is a much bigger value to be gained from the Big Data computing movement than just correlations or sifting through transactions to calculate some metric or finding a needle in a hay stack from petabytes of data.

Insights are not Enough
Extracting insights from vast volumes of structured and loosely structured data has many applications, but the ultimate application of this is enabling computing systems to make smart and intelligent decisions with less and less human involvement. This is what leads to lower costs and improved productivity and what has historically been part of the human evolution where it relates to technology. We have evolved over the decades to have machines do more work for us, so the smarter our machines get and the more autonomous they get the more we evolve as a technology driven society.

Automation and Intelligence
Ultimately Big Data can help us go beyond just a discussion around finding correlations or summarizing metrics to generate visually captivating reports. The ultimate benefit business can gain from Big Data is no different from what it has always been in the past with other computing and communications technology advances. It is about automation in its simplest form and in the most advanced form it is about enabling software and computers to power artificial intelligence to enable system autonomy. The smarter and more independent our systems are the more we advance and the more efficient business becomes. This drives getter productivity and effectiveness in all aspects of business. This will, for example, allow us to build power plants that run themselves much more efficiently, to build computers like IBM Watson that can make human like decisions, to automation software like Siri and Google Now that can understand what we want and deliver the right information to us at the exact time we need it. So Big Data is many things, but ultimately it will turn our computers and data into information that will automate all aspects of our lives and make business more efficient and productive.

The Time for Artificial Intelligence is Now
With advances in distributed computing, networking, and storage the time has come for AI to be at the heart of what of Big Data is all about. Big Data will allow AI to achieve the potential we have all dreamed it could be. AI has never achieved many of the scifi type capabilities we have all grown up watching on TV and in movies. Big Data will be what allows AI to achieve its full potential and this will make many things we only dreamt of possible.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Tuesday, May 14, 2013

Hadoop the New 'T' in ETL

ETL is a common computing paradigm used in a variety of data movement and data management scenarios. As demand for more insight into business data as grown, ETL has been used to move more data from operational data stores into OLAP and data warehousing environments. This has expanded the need for analytics and other solutions that rely on data being reconstituted into easier to consume forms or data models more efficient to solve specific problems.

So nothing special going on here, but as data volumes have grown and sources of data have exploded, the transformation part of ETL (the "T") is becoming more of a challenge, especially as organizations demand more near real-time analytics and up to date information. Transforming the volumes of operational data is becoming a computing bottleneck and often limits what you can do with data after it has been transformed and loaded into downstream data marts. See a typical ETL data flow diagram below.

Big Data to the Rescue
With the evolution of big data and Hadoop, new tools have been brought to bear that can provide help in the overall ETL computing process. However, with Hadoop, the ETL model needs to be revisited. Hadoop can bring tremendous computing resources to more efficiently transform data into target models. While Hadoop can serve as part of your overall processing fabric and can be leverage directly for OLAP and itself be used for data warehousing (e.g. HBase data store), it can also serve as a intermediate staging area that can be used to populate traditional relational data marts.

Using Hadoop in this way allows it to be used as an intermediate store for data until it can later be transformed into target models. We can accomplish this "load first" approach using Hadoop, by changing the ETL model around a bit. Instead of extracting and transforming data first, we can instead extract and load data into Hadoop storage, for staging, and then take full advantage of the Hadoop compute infrastructure to transform (using Map Reduce, Impala, Drill…etc) the data into target models that can feed traditional relational data marts and OLAP engines. See diagram for example:

Hadoop for Transformation
This essentially allows organizations to use Hadoop as the transformation platform that allows developers to perform more complex transformations that were not practical in the normal ETL universe. So think of Hadoop as the new super charged "T" in the "ELT" paradigm, where data is moved as efficiently as possible from operational stores and loaded ("L") into HDFS (and HBASE or Cassandra) as fast as possible. Then the "T" can be performed within the Hadoop ecosystem. This allows Hadoop to be a powerful intermediary layer that can drive new analytics and allow existing analytics to keep up with the deluge of data. This also allows existing OLAP and data warehouses to continue to consume data out of Hadoop for existing analytics.

So let us start getting used to the concept of "ELT" as the new big data cousin of ETL. Hadoop is more than just a historical archive or dumping ground for unstructured data. It can be a powerful transform computing layer that can drive better data warehousing for new and existing analytics solutions.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.