Cloud Analytics & ML with Sam Taha

Sunday, October 20, 2013

JobServer and Mesos Make a Great Pair

We are happy to announce the release of JobServer 3.6 beta1 with support for Mesos clustering and distributed job processing. Release 3.6 is an early access release of JobServer with integrated support for Mesos. With this release of JobServer, you can now schedule and run jobs on a Mesos cluster of any size and configuration. Say goodbye to cron jobs!

JobServer has always had support for distributed job scheduling and processing and a great replacement for cron. Now, with Mesos integration, JobServer takes this to next level by incorporating support for dynamic resource management and reliability by leveraging all the advantages of Mesos. JobServer also brings powerful scheduling, reporting and monitoring features to Mesos environments. Distributed job scheduling and batch processing just got more interesting!

With this release you can track and manage jobs as they run across a dynamic and highly resilient cluster of servers. JobServer with Mesos allows you to run scripts and jobs across your cluster of servers and manage how resources are utilized and managed. If you are a Mesos user today, give JobServer a try and say goodbye to cron. If you are a JobServer user, get your compute resources under control with Mesos.

Download the beta release of JobServer v3.6 and tame your IT environment using all the advantages of Mesos and JobServer.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.

Friday, October 18, 2013

Mesos: The Datacenter is the Computer

The data center is the computer. The pendulum is swinging. Traditional cloud and virtualization level resource management in the data center are no longer good enough to efficiently manage the growing demands for computing services needed in the enterprise. The answer for this challenge, to offer more compute and storage services more efficiently, are solutions such as Mesos and YARN. These emerging cluster management platforms are the next evolution for fine grained and efficient resource management of your data center infrastructure and services. As the need for more processing and storage grows, solutions like YARN and Mesos take center stage.

Big Data, mobile, and cloud computing have driven a tremendous amount of growth and innovation, but the byproduct has been more and more computing infrastructure needed to service the growth and manage the explosion of data. This has especially been the case as we have moved to using more clustered commodity hardware and distributed storage. You now have start-ups and smaller companies managing complex multi-node computing infrastructure for things like Hadoop, real-time event streaming, social graphs as well as for managing established core services like data warehousing, ETL and batch processing. All this has put a lot of demands in order to effectively manage and administrator a dynamic hardware computing environment and in many cases it has created isolated silos of resources dedicated to different tasks, for example, your Hadoop cluster is separate from your application services, database servers and legacy batch processing. This does not scale and it not cost effective.

These silos have created inefficiencies within that data center and the enterprise environment. For example, if your Hadoop cluster of 10 nodes is running only 70% of the time at maximum capacity, what are those 10 nodes doing the other 30% of the time? Same can be said for the other services running in the data center. Unless you can treat your entire data center as one shared cluster of resources, you will have inefficiencies and as the number of nodes and services you are managing grows, these inefficiencies will only increase. This is were solutions like Mesos can step in and give your applications and services one holistic view of your computing infrastructure. By using Mesos, you can reduce costs and more efficiently utilize the hardware and storage resources you already have and it allows you to grow more incrementally as more resources are needed.

Companies like Google, Twitter and Facebook are leading the charge to advance the state of art for efficient data center and enterprise computing. Mesos is a great tool and platform to leverage to reduce costs, improve reliability and overall operational efficiency of your operational IT environment. Give Mesos a look. Cheers!

Thursday, August 15, 2013

Protecting your Hadoop Investment

The hype and buzz around Big Data in the tech industry is at astronomical levels. There are many factors driving this (both technical and human), but I won't get into that here. The fact is that Big Data (define it as you like) is here to stay and many organizations need to find their path to the wonderland of bottomless data storage and boundless analytical computing where no byte of data is every thrown away and any question can be asked and answered about your data. Well, at least if you are Facebook or Google.

Hadoop is the leading contender to enable organizations to economically and incrementally take advantage of distributed storage and scalable distributed processing to tackle the Big Data challenges ahead. The days of buying expensive vertically scaling servers and expensive storage systems are over. Hadoop started from the humble beginnings of Map Reduce and distributed storage (HDFS) and now it has expanding to touch and integrate with all corners of the enterprise computing fabric from real-time business intelligence to ETL and data warehousing. These days, most any company with some kind of database or software analytics solution has now put the word "Big" in their title and offer some level of Hadoop integration. Nothing really bad about that, and it is great to see everyone gravitating to the Hadoop ecosystem as an open source standard of sorts for Big Data.

Hadoop presents a lot potential to solve problems that in the past required much more expensive and proprietary systems. Note, that Hadoop in many respects is no less complex (and is by no means free) from past and existing propriety Big Data platforms, as Hadoop has its own complexity challenges such as many distributed hardware moving parts and is a more or less a loose collections of many open source projects. Hadoop has a lot of creative minds and companies driving its fast evolution. But it is not out of the box a plug and play solution nor a one size fits all solution by any stretch of the imagination. Hadoop does not come cheap by any measure, but with Hadoop you have more opportunity to grow your Big Data system as you go, and with the potential with less vendor lock-in and more flexibility over what you pay for (note, I use the world potential here). The value you get out of Hadoop depends on your expectations and on your investment in people and training along with key decisions you make along the way.

So how does an organization begin down the road of figuring out how Hadoop fits into their existing ecosystem and how much and how fast to invest in Hadoop? Let's see if we can walk through some common questions, challenges and experiences one would go through as they begin their Hadoop quest.

First you need to understand what makes Hadoop tick.
It is important to understand that out of the gate Hadoop does not necessarily invent anything that has not existing before in other products. There are some novel concepts in Hadoop, but overall Hadoop offers nothing altogether new. There are some cool innovations in Hadoop, but fundamentally Hadoop is about a few key concepts. It is founded on the concept of distributed computing and distributed storage using commodity hardware. But ultimately Hadoop is about growing your data storage and processing in an incremental and economical way using largely open source technology and off the shelf hardware. Note, open source does not mean free of course.

Okay, so what problem do we want to solve with Hadoop? Please don't say all of them.
One of the nice things about Hadoop is that organizations of any size can adopt it. You can be a small startup with and simple idea and run your Hadoop on a small clusters on Amazon or you can be a larger enterprise and have a massive clusters performing high-end processing, such as crawling and indexing the entire web. Hadoop can be used in a variety of situations such as to reliably store large volumes of data on commodity storage or it can be used for much more complex computing, ETL, NoSQL and analytical processing.

For larger organizations that are getting started with Big Data, it is vital to identify some key problems you want solved with Hadoop and that might fit and integrate well with existing legacy systems. Hadoop is particularly good at being a holding area for unstructured data like web or user logs that you might want to keep in raw format for later analysis and auditing, for example. What is typically important is to start small and solve some specific problems on specific data sets and then expand your application of Hadoop as you go. This includes getting accustomed to the many programming and DSL packages that can be used to process Hadoop data.

Hey, in a Big Data universe we never throw anything away.
Some of the talk circling around Big Data often mentions how the typical application of Hadoop is to always store everything forever. Obviously this is not practical. Now, many vendors that are providing software and hardware for Hadoop would love for you try to do this, but the reality is that you still need to understand your data limits and have clear aging and time to live policies. Hadoop does let you scale your storage out to petabytes, potentially, but there is no free lunch here. Also, a critical aspect to this is understanding the format you store your data in, within Hadoop. Again here, you hear a lot of talk about storing all your data in "raw format" so you can have all the details in order to extract deep information form your data in the future. While this sounds great in theory, again this is not practical in most cases. In reality, you can keep some data in raw format, but you must typically transform your Hadoop data in other formats besides just unstructured HDFS sequence files, for example. Structure does matter as you get into more complex analytics in Hadoop. Storing your data in HDFS also often means transforming it into semi-structured column stores for use by tools such as Hive and HBase and other query engines, for better performance. So structure matters and expect to have your data stored in Hadoop in possibly multiple formats or at least transformed via Hadoop based ETL into formats other than the "raw" acquisition format. This all adds up to more and more storage requirements. So make sure you understand the math to properly size your Hadoop storage needs.

Now this software is open source which means mostly free, right?
Obviously we have all learned by now that open source does not necessary mean free. Red Hat, as an example, has a pretty good business around open source and they are quite successful at making a profit. Hadoop vendors are no different. There are several well funded start-ups that have Red Hat like business models around Hadoop, not to mention all the big boys trying to retrofit their existing Big Data solutions to be Hadoop friendly. None of them are free, but they all are different from each other. And it is important to understand each Hadoop vendor's strengths and weakness and where they are coming from. The vendor's history does matter for a lot of reasons that I will discuss in a later post.

Now, in theory you could go it alone, and use Hadoop completely free - just download most of the Hadoop packages from Apache (and a few other places). For example, I have downloaded and installed versions of Hadoop from the Apache Foundation and have been ale to run basic Map Reduce and HDFS jobs running on small clusters - all for free and without going through any Hadoop vendors. You can also use community versions from the various Hadoop distributions from the major Hadoop vendors. This can work, but you are on your own and how feasible this approach is depends who you are and how savvy your technical staff are. It is also important to understand how the various Hadoop distributions and players differ from each other and how much you are getting "locked in" with each Hadoop vendor. The retro-fitted Hadoop vendors (as I call them) have a lot more polish and savvy when they pitch Hadoop to you while some of the Hadoop startup vendors have varying degree's of proprietary software embedded in their Hadoop distributions. It is critical to understand these facts and it is important to consider how much you are willing to build on top of Hadoop yourself vs relying 100% on your Hadoop partner. These are important considerations that can sometimes get lost in internal management jockeying over who will be the Big Data boss. Vendor lock-in is very important to understand along with clearly planning for sizing, capacity and long-term incremental growth of your cluster.

This all leads to understanding the cost of Hadoop as you set expectations over what problems you want your Hadoop cluster to solve from day one. Sizing your Hadoop cluster for storage, batch computing, real-time analytics/streaming, and data warehousing must be considered. How you capacity plan your storage, HDD spindles, and cpu cores are critical decisions as you plan the nuts and bolts of your Hadoop cluster. Your Hadoop partner/vendor can help you with this sizing and planing, but again here, each vendor will approach it differently depending on who they are and who you are (how deep your pockets are). You have to be smart here and know what is in your best interest long-term.

Your Hadoop cluster is not an island.
It is vital to consider how your Hadoop cluster will fit in with your current IT environment and existing data warehousing and BI environments. Hadoop will typically not totally replace your existing ETL, data warehousing and BI systems. In many cases, it will live alongside existing BI systems. It is also vital to understand how you will be moving data efficiently into your Hadoop cluster and how much processing and storage is needed to put data into intermediate formats for optimal performance and efficient consumption by applications. These are critical questions to answer in order to get your Hadoop cluster running efficiently to effectively feed downstream systems.

You mean my Hadoop cluster does not run itself?
One under estimated area concerning Hadoop, is planning for the operations and on-going management of your Hadoop cluster. Hadoop is good technology, but is fast evolving and has many move parts both at an infrastructure level (lot of nodes and HDDs) and from software package perspective (lot of software packages that are fast evolving). This makes running, monitoring and upgrading/patching Hadoop a non-trivial task. For example, many of the Hadoop vendors offer both open source and proprietary solutions for managing and running your clusters. This obviously requires your operations and production IT staff to be included in the planning and management of your clusters.

Some other important questions and considerations as you get started with Hadoop.

How will multi-tenancy and sharing work if more than one group is going to be using your cluster.
Should I have one or a few big Hadoop clusters, or many small clusters
Understand your storage, processing, and concurrency needs. Not all Hadoop schedulers are created equal for all situations.
Do you need or want to leverage virtualization and or cloud bursting?
Choose your hardware carefully to keep costs per TB low. How to mange TB vs cpu/core is important.
Understand what you need in your edge nodes for utility and add-on software.
Plan your data acquisition and export needs between your Hadoop cluster and the rest of your ecosystem.
Understand your security needs at a data and functional level.
What are your up time requirements? Plan for rolling patches and upgrades.

Maybe I should have stated this in the beginning, but the reason I called this blog Protecting your Hadoop Investment, is because many organizations enter into this undertaking without a clear understand of:

Why they are pursuing Big Data (other than it is the hot thing to do).
How Hadoop differs from past propriety Big Data solutions.
How it can fit along side existing legacy systems.
How to ultimately manage costs and expectations at both a management and technical level.

If you do not understand these points, then you will waste a lot of time and money and fail to take effective advantage of Hadoop. So, strap in and enjoy your Hadoop and Big Data adventure. It will be a journey as much as a destination and it will transform your organization for the better if you plan appropriately and enter into it with your eyes wide open.

Thursday, July 25, 2013

JobServer 3.4.28 - Isolated JVM Containers

We are happy to announce the release of JobServer 3.4.28 which adds a number of new features for administrators along with supporting the latest version of Google Web Toolkit and expanded remote management APIs.

With this release, JobServer now supports expanded remote web services programatic APIs. Also included in this release is the capability to run distributed jobs under customizable Linux/Unix userspace accounts on a job by job basis, which gives administrators fined grained control over how they run their jobs. This allows users to run jobs inside isolated JVMs in a more granular fashion.

It has always been our focus to make JobServer the most developer and IT friendly scheduling and job processing platform on the planet. We are proud of our focus on taking customer and developer feedback to continuously make JobServer the best scheduling and job processing engine around. JobServer tames your job processing and scheduling environment in a way that is a joy for Java developers to customize upon while providing powerful web UI management and administration features for business users and IT operations administrators.

Download and test drive JobServer 3.4.28 today and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.

Saturday, June 1, 2013

Big Data is More Than Correlation and Causality

There is no discounting that the Big Data movement is getting a lot of attention from all avenues of business and technology. Large scale computing has been around for decades, since the days of super computers, and has been brought to the forefront of late by the high flying internet companies. This has been driven in part by significant advances in the availability of commodity hardware, open source distributed computing software, cloud computing, and virtualization among other things.

A lot of the debate as to the value and benefits of Big Data is largely centered around how it can benefit companies in analyzing large data sets to help them make marketing type decisions such as recommending what movie or product you should buy and thus improve the bottom line of these businesses. There are also other applications such as the analysis of vast volumes of sensor or transactional data in order to find patterns using machine learning. The possibilities for applying Big Data are abound for both analyzing structured and unstructured data in order to extract information and improve marketing and overall business decision making.

Correlation vs Causality
One common debate about Big Data is the effectiveness of the analytics applied in Big Data solutions, and whether it really can discover answers to questions or is it just better suited for correlations and not necessarily best suited in identifying precise causality. These debates are good discussion to have and in general I think Big Data can serve many purposes from finding correlations to solving very specific problems from a wide spectrum of data sources. The ability to extract value from Big Data is driven in part by the volume of data available and applying the right machine learning algorithms. However, I believe there is a much bigger value to be gained from the Big Data computing movement than just correlations or sifting through transactions to calculate some metric or finding a needle in a hay stack from petabytes of data.

Insights are not Enough
Extracting insights from vast volumes of structured and loosely structured data has many applications, but the ultimate application of this is enabling computing systems to make smart and intelligent decisions with less and less human involvement. This is what leads to lower costs and improved productivity and what has historically been part of the human evolution where it relates to technology. We have evolved over the decades to have machines do more work for us, so the smarter our machines get and the more autonomous they get the more we evolve as a technology driven society.

Automation and Intelligence
Ultimately Big Data can help us go beyond just a discussion around finding correlations or summarizing metrics to generate visually captivating reports. The ultimate benefit business can gain from Big Data is no different from what it has always been in the past with other computing and communications technology advances. It is about automation in its simplest form and in the most advanced form it is about enabling software and computers to power artificial intelligence to enable system autonomy. The smarter and more independent our systems are the more we advance and the more efficient business becomes. This drives getter productivity and effectiveness in all aspects of business. This will, for example, allow us to build power plants that run themselves much more efficiently, to build computers like IBM Watson that can make human like decisions, to automation software like Siri and Google Now that can understand what we want and deliver the right information to us at the exact time we need it. So Big Data is many things, but ultimately it will turn our computers and data into information that will automate all aspects of our lives and make business more efficient and productive.

The Time for Artificial Intelligence is Now
With advances in distributed computing, networking, and storage the time has come for AI to be at the heart of what of Big Data is all about. Big Data will allow AI to achieve the potential we have all dreamed it could be. AI has never achieved many of the scifi type capabilities we have all grown up watching on TV and in movies. Big Data will be what allows AI to achieve its full potential and this will make many things we only dreamt of possible.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Tuesday, May 14, 2013

Hadoop the New 'T' in ETL

ETL is a common computing paradigm used in a variety of data movement and data management scenarios. As demand for more insight into business data as grown, ETL has been used to move more data from operational data stores into OLAP and data warehousing environments. This has expanded the need for analytics and other solutions that rely on data being reconstituted into easier to consume forms or data models more efficient to solve specific problems.

So nothing special going on here, but as data volumes have grown and sources of data have exploded, the transformation part of ETL (the "T") is becoming more of a challenge, especially as organizations demand more near real-time analytics and up to date information. Transforming the volumes of operational data is becoming a computing bottleneck and often limits what you can do with data after it has been transformed and loaded into downstream data marts. See a typical ETL data flow diagram below.

Big Data to the Rescue
With the evolution of big data and Hadoop, new tools have been brought to bear that can provide help in the overall ETL computing process. However, with Hadoop, the ETL model needs to be revisited. Hadoop can bring tremendous computing resources to more efficiently transform data into target models. While Hadoop can serve as part of your overall processing fabric and can be leverage directly for OLAP and itself be used for data warehousing (e.g. HBase data store), it can also serve as a intermediate staging area that can be used to populate traditional relational data marts.

Using Hadoop in this way allows it to be used as an intermediate store for data until it can later be transformed into target models. We can accomplish this "load first" approach using Hadoop, by changing the ETL model around a bit. Instead of extracting and transforming data first, we can instead extract and load data into Hadoop storage, for staging, and then take full advantage of the Hadoop compute infrastructure to transform (using Map Reduce, Impala, Drill…etc) the data into target models that can feed traditional relational data marts and OLAP engines. See diagram for example:

Hadoop for Transformation
This essentially allows organizations to use Hadoop as the transformation platform that allows developers to perform more complex transformations that were not practical in the normal ETL universe. So think of Hadoop as the new super charged "T" in the "ELT" paradigm, where data is moved as efficiently as possible from operational stores and loaded ("L") into HDFS (and HBASE or Cassandra) as fast as possible. Then the "T" can be performed within the Hadoop ecosystem. This allows Hadoop to be a powerful intermediary layer that can drive new analytics and allow existing analytics to keep up with the deluge of data. This also allows existing OLAP and data warehouses to continue to consume data out of Hadoop for existing analytics.

So let us start getting used to the concept of "ELT" as the new big data cousin of ETL. Hadoop is more than just a historical archive or dumping ground for unstructured data. It can be a powerful transform computing layer that can drive better data warehousing for new and existing analytics solutions.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Friday, April 26, 2013

Data In and Out of the Hybrid Cloud

The continued adoption, and accleration of interest, of cloud computing has creating an interesting problem for the large enterprise. Most large enterprises have existing private network and intranets. As parts of the organization begin to adopt public and private clouds, there is a challenge of moving data in and out of these clouds.

Data movement is more challenging for a number of reasons including security and network accessability. For example, you can usually can't walk up to these clouds and load a tape or hard drive. And badwidth is often times more restricted than that between internal networks. Appliations you have running in your private network may not be able to talk directly to the cloude without going through web services or other new networking schemes.

Then there is the question of cost. When does a business decide it is time to bring back data from the cloud? There is a point where keeping certain data in the cloud could become cost prohibative. While the cost of cloude computing and storage is always going down and newer services popup up all the time (like Amazon Glacier for example), this issue is not going away.

Also for Big Data type computation and data storage, at what point is keeping your data running in an Amazone EMR or stored in S3 beceome prohibatively expensive? All these questions are important for organizations to understand as the adoption of cloud computing and Big Data analytics accelerate. There is no simple answer of ourse, but it is important for organizations to consider these questions with both from an IT and financal perspective.

Saturday, April 20, 2013

Databases are Cool Again

It is definitely interesting what is happening in the database space these days. It is good to see the NoSQL and NewSQL folks spark a fire under the traditional relational vendors. This is the only way to inspire innovation both in the commercial and open source space. To a large extended, the establishted relational player were slow to jump on the cloud and even to this day they are not moving as aggressively as they need to in order to reclaim the cloud database market from the NoSQL and NewSQL upstarts.

Fundamentally, the scale-out capabilities of traditional RDBMS engine still don't hit the sweet spot developers and cloud operations people need these days. I do expect the market will consolidate somewhat in the next several years as there are just too many players at the moment, especially on the NoSQL side. But I expect there to remain a large selection of NoSQL engines over time, as many of the NoSQL players target specialized areas so it is definitely not a one size fits all like it has historically been with relational database. For example, many of the NoSQL engines have made deliberate enginering trade-offs in their products such as in their storage models, consistency, replication, aggregation capabilities, and scale-out…etc. For example, if you need a NoSQL with strong aggregation functions you might choose MongoDB but if you need something that scales out writes and data center replication you might go with Cassandra. So, in the long-term I do not see a single NoSQL that can rule them all.

Tuesday, April 16, 2013

Cloud Job Scheduler on EC2

Grand Logic is pleased to announced a new release of our JobServer Cloud edition product. JobServer Cloud edition allows our customers to access and use the powerful job scheduling, job processing, workflow and SOA messaging features available in JobServer from a cloud environment. JobServer now supports deployments on Amazon EC2, allowing our customers to lower their IT costs and free themselves to focus on their core applications. If you are using EC2 to host your applications, and you are in need of a job scheduling and processing solution, then JobServer is a perfect choice.

JobServer Cloud delivers all the same great features and capabilities found in our core JobServer software and can now be hosted in the clouds. This frees customers from the IT burden of buying and maintaining hardware and installing and managing their own IT environment for JobServer. With the cloud edition of JobServer, you are freed from dealing with upgrades and managing maintenance tasks and dealing with hardware issues. We can add more job processing capacity, quickly, as you need it. You just need to focus on building and deploying your jobs and java Tasklets. We will ensure that your JobServer environment is well maintained, managed and backed up and running efficiently. We will alert you if we notice performance issues or slow running jobs and processes or if more capacity is needed. Daily and weekly reports are delivered that provide detailed job scheduling and processing statistics. Contact our support team to get setup with a fully managed instance of JobServer.

With our fully managed JobServer Cloud solution, there is no hardware or software to install! With JobServer Cloud, your environment will be deployed and run from the Amazon cloud with secure connectivity to your private network using Amazon VPC (Virtual Private Cloud). You can access your JobServer instance securely to run and manage your jobs and apps with full access to your private corporate network. Many customers also need their JobServer environment to connect and access local IT systems and services within their private corporate network. With Amazon's VPC (Virtual Private Cloud) we can securely bridge between a company’s existing IT infrastructure and your JobServer Cloud environment. Amazon VPC enables enterprises to connect their existing infrastructure to a set of isolated JobServer compute resources via a Virtual Private Network (VPN) connection, and to extend their existing management capabilities such as security services, firewalls, and intrusion detection systems to include their JobServer resources that are in the Amazon cloud.

If you want to install and manage your own JobServer instances, then download JobServer today and install on your EC2 environment to start scheduling and processing jobs. Start with one EC2 instance or scale JobServer to run on hundreds of EC2 instances. JobServer scales easily and effectively on EC2 to allow you to run thousands of jobs. JobServer can scale to meet your needs by taking full advantage of your Amazon cloud environment.

About Grand Logic
Grand Logic is dedicated to delivering software solutions to its customers that help them automate their business and manage their processes. Grand Logic delivers automation software and specializes in mobile and web products and solutions that streamline business.

Saturday, March 9, 2013

Putting NoSQL in Perspective

Deciding between a NoSQL database or a relational database system is about understanding the trade-offs that led to the creation of NoSQL to begin with. NoSQL systems have advantages over traditional SQL databases because they give up certain RDBMS features in order to gain other performance, scalability and developer usability capabilities.

What NoSQL gives up (this varies by NoSQL engine):

Relationships between entities (like tables) are limited to non-existent. For example, you usually can't join tables or models together in a query. Traditional concepts like data normalization don't really apply. But you still must do proper modeling based on the capabilities of the particular NoSQL system. NoSQL data modeling varies by product and whether you are using a document vs column based NoSQL engine. For example, how you might model your data in MongoDB vs HBase varies because each solution offers significantly different capabilities.
Limited ACID transactions. The level of read consistency and atomic write/commit capabilities across one or more tables/entities varies by NoSQL engine.
No standard domain language like SQL for expressing ad-hoc queries. Each NoSQL has its own API and some of the NoSQL vendors have limited ad-hoc query capability.
Less structured and rigid data model. NoSQL typically forces/gives more responsibility at the application layer for the developer to "do the right thing" when it comes to data relationships and consistency. Think of NoSQL as a schema on read instead of the traditional schema on write.

What NoSQL offers:

Easier to shard and distribute the data across a cluster of servers. Partly because of lack of data relationships it is easier to shard and distribute data across a cluster and capacity more incrementally and horizontally. This can give much higher read/write scalability and fail-over capabilities, for example.
Can more easily deploy on cheaper commodity hardware (and in the cloud) and expand scalability more incrementally and economically.
Don't need as much up-front DBA type of support. But if your NoSQL gets big you will spend a lot of time doing admin work regardless.
NoSQL has a looser data model, so you can have sparser data sets and variable data sets organized in documents or name/value column sets. Data models are not as hard wired.
Schema migrations can be easier but puts burden on application layer (developer) to adjust to changes in the data model.
Depending on what type of application you are building, NoSQL can make getting started a little easier since you need less time planning for your data model. So for collecting high velocity and variable data, NoSQL can be great. But for modeling a complex ERP application it may not be such a great fit.

Like with most things, there is no sliver bullet here with NoSQL. There are many different products available these days and each has its own particular specialties and pros/cons. In general, you should think of NoSQL as a complementary data storage solution and not as a complete replacement of your relational/SQL systems, but this will depend on your applications and product functionality requirements. For example, how you use NoSQL in a analtyics environment vs a OLTP setting can greatly effect how you use NoSQL and which specific NoSQL engine you choose. Keep in mind you may end up using more than oen NoSQL product within your environment based the specific capabilities of each - this is not uncommon.

Relational databases are also evolving, for example, new hybrid NoSQL oriented storage engines are coming out based around MySQL. Also products like NuoDB and VoltDB (what some are calling NewSQL) are trying to evolve relational databases beyond the vertical scaling and legacy storage computing restrictions of the past, by using a fundamentally different architecture from the ground up. Keep you seat belts fastened, the database landscape has not been this innovative and fast moving in decades.

Tuesday, March 5, 2013

SQL and MPP: The Next Phase in Big Data

Over the past couple years we have all by now heard about the Big Data movement. Two key enablers in this remaking of analytics, data warehousing and general computing have been the NoSQL database technology movement and the emerging Hadoop compute stack. While not directly related to each other, both NoSQL and Hadoop have become associated with the rapidly accelerating Big Data revolution as more companies look to manage larger and larger data sets more effectively and economically. NoSQL has been the new kid on the block in the database space by attempting to take applications and data to the promised land of web-scale computing where traditional relational databases have fallen short. Over the past decade, SQL and relational database technology have failed to effectively keep up with developer needs and the scaling demands of a new generation of social and data heavy applications and this has opened the door to a different approach from the NoSQL camp. Hadoop is also in the same position by promising to deliver analytics and offline batch computing power not practical or cost effective before the emergence of HDFS and Map Reduce, which are for the most part, currently only found in expensive and proprietary analtyics products.

As with many proclaimed revolution such as Big Data, this is just the tip of the iceberg as they say. Like with most transformations in technology there is more to come as these technologies penetrate into more industries and gain wider adoption and broader acceptance by the open source community and the established heavy hitters. The next wave of Big Data technology will push the edges into other domains and go beyond the offline computing boundaries of HDFS and Map Reduce. While SQL and relational database centered analytics has taken a back seat lately because of the emergence NoSQL, SQL as a domain language will get an uplift with the emergence of the next Big Data wave as we move past the basic offline Map Reduce paradigm and look towards more real-time computing engines that can enable MPP (massively parallel processing) computing. This will allow IT organizations to continue to benefit from the low cost of commodity hardware and horizontal scaling benefits brought about by Map Reduce and HDFS and now generalized further for real-time analytics.

While NoSQL has established itself as a technology that is here to stay, the traditional relational database paradigm is not gone by any stretch and still provides an invaluable ad hoc query function to analysts and developers alike. NoSQL products like Cassandra, HBase and MongoDB (to mention a few) solve a unique problem and are becoming key foundations of any web-scale computing stack whether for online CRUD apps or for offline analtyics. But that does not eliminate the need or diminish the power of relational SQL engines and SQL as a powerful expressive domain language. NoSQL is not a silver bullet but can be a powerful complementary solution to traditional relational data storage models. The NoSQL folks have used the classic engineering trade-off where they have exchanged certain features found in relational databases to gain greater horizontal scalability. I will not get into the details of this but I do not want to over simplify what NoSQL has done. At the heart of the trade-off is eliminating relationships between data entities for the benefit of allowing for greater horizontal scalability. NoSQL also give the developer a more flexible "on-read" schema model that has its benefits.

So what does this all mean? Well, expect NoSQL and the the current 1.0 Hadoop stack to continue to mature and become more mainstream - that is no-brainer. But for the next phase I see SQL (for ad hoc querying) and real-time MPP becoming part of this Big Data fabric and this will bring back the ad hoc capabilities of relational database but now with the horizontal scaling and cost effectiveness found in HDFS and Map Reduce.

You can see this next phase is already happening by just observing all the commercial products rushing to extend their traditional analtyics engines to work on top of Hadoop and all the investment going into taking Hadoop beyond its current it offline Map Reduce roots. They very from open source next generation MPP platforms, to cloud providers offering analytics as a service, to traditional data warehouse vendors extending their products to run on time of Hadoop to next generation relational database start-ups. Here is a sample of some of the players and products to watch:

Hadoop 2.0 Players

Cloudera - Impala
Cloudera is leading the charge to create a next generation open source MPP platform that builds on the core components of Hadoop (HDFS, Zookeeper, MR...etc) to enable real-time analytics of Big Data. The initiative is open source but primarily driven (at least for now) by Cloudera. This is also partly a recognition that Map Reduce and tools like Hive are fine for certain offline analytics and processing but are not a complete solution for real-time reporting and analytics.

MapR - Apache Drill
This is a similar project to Impala but channeled through the Apache organization and primary driven by MapR (Cloudera Hadoop competitor).

Hadapt
Vertical solution for Hadoop for organizations wanting a more SQL friendly interface to their Hadoop data sources.

Datameer
Another Hadoop vertical player that is trying to make analytics and reporting easier for the Hadoop stack.

Cloud Players

Google - Big Query
This is Google's cloud services that is a combination of a distributed data store coupled with a powerful SQL like ad hoc query engine (based on the Dremel language).

Amazon - RedShift
Amazon service to help businesses more economically build data warehouses in the clouds with ad hoc SQL query interface. Partially based on technology from ParAccel.

Old School Players

IBM - Netezza
While traditionally focused on enterprise data warehousing, IBM is evolving their stack to fit and play nice with Hadoop and other Big Data solutions.

HP - Vertica
HP's Big Data play. Like IBM and Terradata, HP acquired their way into the Big Data space.

Teradata - Aster Data
Teradata is a true old school player in the Big Data space when the world only centered around relational databases. Their acquisition of Aster Data changed that.

Next Generation SQL Players to Watch

NuoDB
NuoDB is the new kid on the block promising a new way to scale and build relational databases in the cloud. Their approach is more or less based on a peer to peer model that allows them to scale out (as they claim) while still delivering on the traditional capabilities of relational database such as read consistency and ACID transactions. While NuoDB is more focused on OLTP type processing its claim that it can scale horizontally while supporting a SQL relational model makes it potentially powerful for real-time analytics as well.

VoltDB
Another new age relational database engine that delivers horizontal scaling yet retaining SQL capabilities. Differs from NuoDB by taking a caching approach to meet scaling challenge.

For the next wave of Big Data innovation, the landscape is rapidly changing with both old and new industry players getting into the game. Big Data will no longer be limited to offline and long latency based analytics processing. The lines between OLTP, OLAP and Enterprise Data Warehousing are blurring as offline computing, real-time analytics and data storage models evolve and converge. Expect better technology options and improved cloud scalability at lower price of ownership as the competition heats up and the next evolution of Big Data matures. Pick a horse and run with it. Stay tuned.

Wednesday, February 27, 2013

Grand Logic a Featured Cloudera Partner

We are proud to be selected by Cloudera as part of their featured partner list and to join the largest and fastest growing Apache Hadoop ecosystem. Grand Logic has been a strong proponent of Apache Hadoop and the potential of Big Data computing.

Grand Logic has integrated support for Hadoop and other Big Data technologies into our flagship product, JobServer. This has brought enterprise job processing and automation to Big Data computing and converged traditional business automation and job processing with Big Data analytics and computing. There are no islands or barriers here. JobServer allows enterprises of all sizes from startups to Fortune 500 to converge their back office businesses processing, SOA assets, and Big Data computing infrastructure such as MapReduce, Big Query, Hive queries, Impala queries, Pig…etc, all under one job scheduling management platform.

Download and test drive JobServer today and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs (Hadoop jobs, SOA jobs, ETL jobs, BigQuery jobs, Hive Jobs…etc) and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.