Tuesday, May 14, 2013

Hadoop the New 'T' in ETL

ETL is a common computing paradigm used in a variety of data movement and data management scenarios. As demand for more insight into business data as grown, ETL has been used to move more data from operational data stores into OLAP and data warehousing environments. This has expanded the need for analytics and other solutions that rely on data being reconstituted into easier to consume forms or data models more efficient to solve specific problems.



So nothing special going on here, but as data volumes have grown and sources of data have exploded, the transformation part of ETL (the "T") is becoming more of a challenge, especially as organizations demand more near real-time analytics and up to date information. Transforming the volumes of operational data is becoming a computing bottleneck and often limits what you can do with data after it has been transformed and loaded into downstream data marts. See a typical ETL data flow diagram below.



Big Data to the Rescue
With the evolution of big data and Hadoop, new tools have been brought to bear that can provide help in the overall ETL computing process. However, with Hadoop, the ETL model needs to be revisited. Hadoop can bring tremendous computing resources to more efficiently transform data into target models. While Hadoop can serve as part of your overall processing fabric and can be leverage directly for OLAP and itself be used for data warehousing (e.g. HBase data store), it can also serve as a intermediate staging area that can be used to populate traditional relational data marts.

Using Hadoop in this way allows it to be used as an intermediate store for data until it can later be transformed into target models. We can accomplish this "load first" approach using Hadoop, by changing the ETL model around a bit. Instead of extracting and transforming data first, we can instead extract and load data into Hadoop storage, for staging, and then take full advantage of the Hadoop compute infrastructure to transform (using Map Reduce, Impala, Drill…etc) the data into target models that can feed traditional relational data marts and OLAP engines. See diagram for example:



Hadoop for Transformation
This essentially allows organizations to use Hadoop as the transformation platform that allows developers to perform more complex transformations that were not practical in the normal ETL universe. So think of Hadoop as the new super charged "T" in the "ELT" paradigm, where data is moved as efficiently as possible from operational stores and loaded ("L") into HDFS (and HBASE or Cassandra) as fast as possible. Then the "T" can be performed within the Hadoop ecosystem. This allows Hadoop to be a powerful intermediary layer that can drive new analytics and allow existing analytics to keep up with the deluge of data. This also allows existing OLAP and data warehouses to continue to consume data out of Hadoop for existing analytics.

So let us start getting used to the concept of "ELT" as the new big data cousin of ETL. Hadoop is more than just a historical archive or dumping ground for unstructured data. It can be a powerful transform computing layer that can drive better data warehousing for new and existing analytics solutions.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Friday, April 26, 2013

Data In and Out of the Hybrid Cloud

The continued adoption, and accleration of interest, of cloud computing has creating an interesting problem for the large enterprise. Most large enterprises have existing private network and intranets. As parts of the organization begin to adopt public and private clouds, there is a challenge of moving data in and out of these clouds.

Data movement is more challenging for a number of reasons including security and network accessability. For example, you can usually can't walk up to these clouds and load a tape or hard drive. And badwidth is often times more restricted than that between internal networks. Appliations you have running in your private network may not be able to talk directly to the cloude without going through web services or other new networking schemes.

Then there is the question of cost. When does a business decide it is time to bring back data from the cloud? There is a point where keeping certain data in the cloud could become cost prohibative. While the cost of cloude computing and storage is always going down and newer services popup up all the time (like Amazon Glacier for example), this issue is not going away.

Also for Big Data type computation and data storage, at what point is keeping your data running in an Amazone EMR or stored in S3 beceome prohibatively expensive? All these questions are important for organizations to understand as the adoption of cloud computing and Big Data analytics accelerate. There is no simple answer of ourse, but it is important for organizations to consider these questions with both from an IT and financal perspective.

Saturday, April 20, 2013

Databases are Cool Again

It is definitely interesting what is happening in the database space these days. It is good to see the NoSQL and NewSQL folks spark a fire under the traditional relational vendors. This is the only way to inspire innovation both in the commercial and open source space. To a large extended, the establishted relational player were slow to jump on the cloud and even to this day they are not moving as aggressively as they need to in order to reclaim the cloud database market from the NoSQL and NewSQL upstarts.

Fundamentally, the scale-out capabilities of traditional RDBMS engine still don't hit the sweet spot developers and cloud operations people need these days. I do expect the market will consolidate somewhat in the next several years as there are just too many players at the moment, especially on the NoSQL side. But I expect there to remain a large selection of NoSQL engines over time, as many of the NoSQL players target specialized areas so it is definitely not a one size fits all like it has historically been with relational database. For example, many of the NoSQL engines have made deliberate enginering trade-offs in their products such as in their storage models, consistency, replication, aggregation capabilities, and scale-out…etc. For example, if you need a NoSQL with strong aggregation functions you might choose MongoDB but if you need something that scales out writes and data center replication you might go with Cassandra. So, in the long-term I do not see a single NoSQL that can rule them all.

Tuesday, April 16, 2013

Cloud Job Scheduler on EC2

Grand Logic is pleased to announced a new release of our JobServer Cloud edition product. JobServer Cloud edition allows our customers to access and use the powerful job scheduling, job processing, workflow and SOA messaging features available in JobServer from a cloud environment. JobServer now supports deployments on Amazon EC2, allowing our customers to lower their IT costs and free themselves to focus on their core applications. If you are using EC2 to host your applications, and you are in need of a job scheduling and processing solution, then JobServer is a perfect choice.

JobServer Cloud delivers all the same great features and capabilities found in our core JobServer software and can now be hosted in the clouds. This frees customers from the IT burden of buying and maintaining hardware and installing and managing their own IT environment for JobServer. With the cloud edition of JobServer, you are freed from dealing with upgrades and managing maintenance tasks and dealing with hardware issues. We can add more job processing capacity, quickly, as you need it. You just need to focus on building and deploying your jobs and java Tasklets. We will ensure that your JobServer environment is well maintained, managed and backed up and running efficiently. We will alert you if we notice performance issues or slow running jobs and processes or if more capacity is needed. Daily and weekly reports are delivered that provide detailed job scheduling and processing statistics. Contact our support team to get setup with a fully managed instance of JobServer.

With our fully managed JobServer Cloud solution, there is no hardware or software to install! With JobServer Cloud, your environment will be deployed and run from the Amazon cloud with secure connectivity to your private network using Amazon VPC (Virtual Private Cloud). You can access your JobServer instance securely to run and manage your jobs and apps with full access to your private corporate network. Many customers also need their JobServer environment to connect and access local IT systems and services within their private corporate network. With Amazon's VPC (Virtual Private Cloud) we can securely bridge between a company’s existing IT infrastructure and your JobServer Cloud environment. Amazon VPC enables enterprises to connect their existing infrastructure to a set of isolated JobServer compute resources via a Virtual Private Network (VPN) connection, and to extend their existing management capabilities such as security services, firewalls, and intrusion detection systems to include their JobServer resources that are in the Amazon cloud.

If you want to install and manage your own JobServer instances, then download JobServer today and install on your EC2 environment to start scheduling and processing jobs. Start with one EC2 instance or scale JobServer to run on hundreds of EC2 instances. JobServer scales easily and effectively on EC2 to allow you to run thousands of jobs. JobServer can scale to meet your needs by taking full advantage of your Amazon cloud environment.

About Grand Logic
Grand Logic is dedicated to delivering software solutions to its customers that help them automate their business and manage their processes. Grand Logic delivers automation software and specializes in mobile and web products and solutions that streamline business.

Saturday, March 9, 2013

Putting NoSQL in Perspective

Deciding between a NoSQL database or a relational database system is about understanding the trade-offs that led to the creation of NoSQL to begin with. NoSQL systems have advantages over traditional SQL databases because they give up certain RDBMS features in order to gain other performance, scalability and developer usability capabilities.

What NoSQL gives up (this varies by NoSQL engine):
  • Relationships between entities (like tables) are limited to non-existent. For example, you usually can't join tables or models together in a query. Traditional concepts like data normalization don't really apply. But you still must do proper modeling based on the capabilities of the particular NoSQL system. NoSQL data modeling varies by product and whether you are using a document vs column based NoSQL engine. For example, how you might model your data in MongoDB vs HBase varies because each solution offers significantly different capabilities.
  • Limited ACID transactions. The level of read consistency and atomic write/commit capabilities across one or more tables/entities varies by NoSQL engine.
  • No standard domain language like SQL for expressing ad-hoc queries. Each NoSQL has its own API and some of the NoSQL vendors have limited ad-hoc query capability.
  • Less structured and rigid data model. NoSQL typically forces/gives more responsibility at the application layer for the developer to "do the right thing" when it comes to data relationships and consistency. Think of NoSQL as a schema on read instead of the traditional schema on write.

What NoSQL offers:
  • Easier to shard and distribute the data across a cluster of servers. Partly because of lack of data relationships it is easier to shard and distribute data across a cluster and capacity more incrementally and horizontally. This can give much higher read/write scalability and fail-over capabilities, for example.
  • Can more easily deploy on cheaper commodity hardware (and in the cloud) and expand scalability more incrementally and economically.
  • Don't need as much up-front DBA type of support. But if your NoSQL gets big you will spend a lot of time doing admin work regardless.
  • NoSQL has a looser data model, so you can have sparser data sets and variable data sets organized in documents or name/value column sets. Data models are not as hard wired.
  • Schema migrations can be easier but puts burden on application layer (developer) to adjust to changes in the data model.
  • Depending on what type of application you are building, NoSQL can make getting started a little easier since you need less time planning for your data model. So for collecting high velocity and variable data, NoSQL can be great. But for modeling a complex ERP application it may not be such a great fit.

Like with most things, there is no sliver bullet here with NoSQL. There are many different products available these days and each has its own particular specialties and pros/cons. In general, you should think of NoSQL as a complementary data storage solution and not as a complete replacement of your relational/SQL systems, but this will depend on your applications and product functionality requirements. For example, how you use NoSQL in a analtyics environment vs a OLTP setting can greatly effect how you use NoSQL and which specific NoSQL engine you choose. Keep in mind you may end up using more than oen NoSQL product within your environment based the specific capabilities of each - this is not uncommon.

Relational databases are also evolving, for example, new hybrid NoSQL oriented storage engines are coming out based around MySQL. Also products like NuoDB and VoltDB (what some are calling NewSQL) are trying to evolve relational databases beyond the vertical scaling and legacy storage computing restrictions of the past, by using a fundamentally different architecture from the ground up. Keep you seat belts fastened, the database landscape has not been this innovative and fast moving in decades.

Tuesday, March 5, 2013

SQL and MPP: The Next Phase in Big Data

Over the past couple years we have all by now heard about the Big Data movement. Two key enablers in this remaking of analytics, data warehousing and general computing have been the NoSQL database technology movement and the emerging Hadoop compute stack. While not directly related to each other, both NoSQL and Hadoop have become associated with the rapidly accelerating Big Data revolution as more companies look to manage larger and larger data sets more effectively and economically. NoSQL has been the new kid on the block in the database space by attempting to take applications and data to the promised land of web-scale computing where traditional relational databases have fallen short. Over the past decade, SQL and relational database technology have failed to effectively keep up with developer needs and the scaling demands of a new generation of social and data heavy applications and this has opened the door to a different approach from the NoSQL camp. Hadoop is also in the same position by promising to deliver analytics and offline batch computing power not practical or cost effective before the emergence of HDFS and Map Reduce, which are for the most part, currently only found in expensive and proprietary analtyics products.

As with many proclaimed revolution such as Big Data, this is just the tip of the iceberg as they say. Like with most transformations in technology there is more to come as these technologies penetrate into more industries and gain wider adoption and broader acceptance by the open source community and the established heavy hitters. The next wave of Big Data technology will push the edges into other domains and go beyond the offline computing boundaries of HDFS and Map Reduce. While SQL and relational database centered analytics has taken a back seat lately because of the emergence NoSQL, SQL as a domain language will get an uplift with the emergence of the next Big Data wave as we move past the basic offline Map Reduce paradigm and look towards more real-time computing engines that can enable MPP (massively parallel processing) computing. This will allow IT organizations to continue to benefit from the low cost of commodity hardware and horizontal scaling benefits brought about by Map Reduce and HDFS and now generalized further for real-time analytics.

While NoSQL has established itself as a technology that is here to stay, the traditional relational database paradigm is not gone by any stretch and still provides an invaluable ad hoc query function to analysts and developers alike. NoSQL products like Cassandra, HBase and MongoDB (to mention a few) solve a unique problem and are becoming key foundations of any web-scale computing stack whether for online CRUD apps or for offline analtyics. But that does not eliminate the need or diminish the power of relational SQL engines and SQL as a powerful expressive domain language. NoSQL is not a silver bullet but can be a powerful complementary solution to traditional relational data storage models. The NoSQL folks have used the classic engineering trade-off where they have exchanged certain features found in relational databases to gain greater horizontal scalability. I will not get into the details of this but I do not want to over simplify what NoSQL has done. At the heart of the trade-off is eliminating relationships between data entities for the benefit of allowing for greater horizontal scalability. NoSQL also give the developer a more flexible "on-read" schema model that has its benefits.

So what does this all mean? Well, expect NoSQL and the the current 1.0 Hadoop stack to continue to mature and become more mainstream - that is no-brainer. But for the next phase I see SQL (for ad hoc querying) and real-time MPP becoming part of this Big Data fabric and this will bring back the ad hoc capabilities of relational database but now with the horizontal scaling and cost effectiveness found in HDFS and Map Reduce.

You can see this next phase is already happening by just observing all the commercial products rushing to extend their traditional analtyics engines to work on top of Hadoop and all the investment going into taking Hadoop beyond its current it offline Map Reduce roots. They very from open source next generation MPP platforms, to cloud providers offering analytics as a service, to traditional data warehouse vendors extending their products to run on time of Hadoop to next generation relational database start-ups. Here is a sample of some of the players and products to watch:

Hadoop 2.0 Players

Cloudera - Impala
Cloudera is leading the charge to create a next generation open source MPP platform that builds on the core components of Hadoop (HDFS, Zookeeper, MR...etc) to enable real-time analytics of Big Data. The initiative is open source but primarily driven (at least for now) by Cloudera. This is also partly a recognition that Map Reduce and tools like Hive are fine for certain offline analytics and processing but are not a complete solution for real-time reporting and analytics.

MapR - Apache Drill
This is a similar project to Impala but channeled through the Apache organization and primary driven by MapR (Cloudera Hadoop competitor).

Hadapt
Vertical solution for Hadoop for organizations wanting a more SQL friendly interface to their Hadoop data sources.

Datameer
Another Hadoop vertical player that is trying to make analytics and reporting easier for the Hadoop stack.

Cloud Players

Google - Big Query
This is Google's cloud services that is a combination of a distributed data store coupled with a powerful SQL like ad hoc query engine (based on the Dremel language).

Amazon - RedShift
Amazon service to help businesses more economically build data warehouses in the clouds with ad hoc SQL query interface. Partially based on technology from ParAccel.

Old School Players

IBM - Netezza
While traditionally focused on enterprise data warehousing, IBM is evolving their stack to fit and play nice with Hadoop and other Big Data solutions.

HP - Vertica
HP's Big Data play. Like IBM and Terradata, HP acquired their way into the Big Data space.

Teradata - Aster Data
Teradata is a true old school player in the Big Data space when the world only centered around relational databases. Their acquisition of Aster Data changed that.

Next Generation SQL Players to Watch

NuoDB
NuoDB is the new kid on the block promising a new way to scale and build relational databases in the cloud. Their approach is more or less based on a peer to peer model that allows them to scale out (as they claim) while still delivering on the traditional capabilities of relational database such as read consistency and ACID transactions. While NuoDB is more focused on OLTP type processing its claim that it can scale horizontally while supporting a SQL relational model makes it potentially powerful for real-time analytics as well.

VoltDB
Another new age relational database engine that delivers horizontal scaling yet retaining SQL capabilities. Differs from NuoDB by taking a caching approach to meet scaling challenge.

For the next wave of Big Data innovation, the landscape is rapidly changing with both old and new industry players getting into the game. Big Data will no longer be limited to offline and long latency based analytics processing. The lines between OLTP, OLAP and Enterprise Data Warehousing are blurring as offline computing, real-time analytics and data storage models evolve and converge. Expect better technology options and improved cloud scalability at lower price of ownership as the competition heats up and the next evolution of Big Data matures. Pick a horse and run with it. Stay tuned.

Wednesday, February 27, 2013

Grand Logic a Featured Cloudera Partner

We are proud to be selected by Cloudera as part of their featured partner list and to join the largest and fastest growing Apache Hadoop ecosystem. Grand Logic has been a strong proponent of Apache Hadoop and the potential of Big Data computing.

Grand Logic has integrated support for Hadoop and other Big Data technologies into our flagship product, JobServer. This has brought enterprise job processing and automation to Big Data computing and converged traditional business automation and job processing with Big Data analytics and computing. There are no islands or barriers here. JobServer allows enterprises of all sizes from startups to Fortune 500 to converge their back office businesses processing, SOA assets, and Big Data computing infrastructure such as MapReduce, Big Query, Hive queries, Impala queries, Pig…etc, all under one job scheduling management platform.

Download and test drive JobServer today and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs (Hadoop jobs, SOA jobs, ETL jobs, BigQuery jobs, Hive Jobs…etc) and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.