Cloud Analytics & ML with Sam Taha |

Wednesday, December 2, 2015

No Compromise Database with NoSQL & Apache Spark

Database technology has been going through a renaissance over the past several years. Relational databases have matured steadily over the past couple of decades, however relational databases were not well equipped to deal with the data volume, velocity and variety (three Vs) that is now demanded by the world of social apps, mobile, IoT, and Big Data - just to name a few.

We are now seeing many new database engines coming to the market (commercial and open source) geared to servicing paritcular applications domains and functional verticals. There is some awsome innovation happening, but the common theme you see with the vast majority of these databases is that they give up something from the traditional relational database world to achieve the level of, for example, CAP theorem suite spot they are aiming for or volume/scalability/throughput they are trying to achieve.

The most common tradeoff given up by many of the NoSQL database engines, for example, is the elimination of table or entity joining. Joining data sets is a fundamental part of the relational model that allows for modeling data using a normalization approach and having a schema that can server multiple application scenarios. This approach is different with NoSQL database. When designing a NoSQL database schema the modeling of the schema/data (or lack of schema - less rigid schema) is very tightly coupled with how the applications will use the schema. So NoSQL databases tradeoff the strong typing of the relation world but push more complexity to the application tier.

The fact that joining is missing from many of the popular NoSQL engines (Cassandra, MongoDB...) puts more complexity on the application tier to help offer functionality such as combining and mashing different data sources together. For example, trying to do a join between to data sets pulled from two different tables or storage engines can be complex and hard to scale in the application tier. Enter Apache Spark into the picture. With Spark, application developers can use Spark's grid computing capabilities to perform database engine type operations without reinventing the wheel in the application layer and while at the same time leveraging a highly scalable compute grid and memory management grid with built-in rich data transformation operations (RDDs, map/reduce, filters, joins...).

Combining Apache Spark with your backend application services is a powerful way to scale NoSQL databases by allowing for rich data operations across multiple tables, documents and polyglot data sources. And this can be done while leveraging Sparks very rich and expressive APIs and highly scalable processing and memory caching.

So Spark is not just for petabyte scale Big Data number crunching and machine learning tasks. You can use Spark in your traditional data management tier to join desperate data entities and use it for rich data processing operations typically provided by relational databases. With Spark you get the benefits of NoSQL without compromise.

Embed Spark into your backend application tier and give Apache Spark a spin, it will change how you build backend services forever.

Wednesday, November 18, 2015

Understanding Apache Spark - Why it Matters

Apache Spark has come on the scene in the past few years and has taken the computing world by storm. It is dubbed as the replacement for Hadoop and often seen as the next evolution in Big Data. Spark is one of the most active Apache projects and has developed a strong ecosystem. Even the Big Data players themselves are adopting it in their stack and positioning it as a key player in their overall open source and productized solutions.

Why has Spark been so successful? How is it better or different than the first incarnation of Big Data (aka Hadoop). Well Spark does not abandon the principles that were realized by Hadoop and companies that helped bring the Big Data philosophy to the masses. Spark builds on the basic building blocks of such technologies, such as HDFS and programming constructs such as Map-Reduce and it does it in a way that makes building application on top of Spark much more efficient and effective than its predecessors.

Spark like Hadoop supports building a computing fabric that can be deployed and can run a commodity type hardware and inherently supports horizontal scaling. Spark lowers the barriers for helping application developers parallelizable their applications and spreading the computing and data access on a cluster of computers for processing. Hadoop does many of the same thing, but Spark does it better from both a technology implementation perspective (more efficient use of memory, garbage collection handling...) and much better application programming API.

What Spark does is raise the bar from a programming interface perspective. It has strong support for Java, Scala, Python and R. Its core operations for managing data (such as RDDs) and computing are very well designed interfaces and APIs. When working with Spark you still have to look at your application and the problem you are trying to solve and think how to parallelize it, but the Spark APIs are intuitive to understand and to use for the typical application programmer. Spark gives you the tools to essentially access the same power a grid computing platform has or distributed database engine might have internally and makes it available to the average programming to embed that same sophistication in their own application.

Spark is a game changer. It can be used for everything from ETL to basic application OLTP computations that drive a GUI to backend batch processing to real-time streaming applications and graph modeling. Spark is truly a game changer that will bring some of the powerful technology pioneered by the internet giants for leveraging distributed computing into applications at levels of the enterprise. Strap your boots and starting learning Spark. It is the next evolution in not just Big Data but in general purpose application programming that can leverage true distributed grid computing and bring it to the programming masses.

Monday, July 27, 2015

Unbundling Database Architecture: Turning Databases Inside-Out

Relational database technology has been around for a few decades now. In the last several years we have seen a resurgence of innovation around data storage and data processing. This has pushed us into the realm of thinking outside of traditional SQL and big iron monolithic computing.

NoSQL, NewSQL and distributed commodity/cloud storage is changing how we build persistence into our applications. However the fundamentals of databases have not changed much. Lower cost memory and the availability of cheaper cloud computing has created a lot of innovation, but how databases function under the hood has not changed very much.

The fundamentals of how transaction atomicity, replication and considerations such as CAP theorem are still tackled in much the same way as they were with the earlier database engines. But is there a different way to look at how applications manage persistence for OLTP type of transactions? Well, Apache Samza presents an interesting approach to how data is managed. While it takes things from a streaming centric approach, this could present a new way for how applications can manage general data storage in the future.

Here is an interesting blog that presents a breakdown how the Apache Samza architecture and how this can facilitate more general purpose application data management by using an "unbundled" architecture in the heart of the database engine. Is this just another specialized data storage engine geared toward steaming data and analytics, or a whole new way to think about database architecture?

Sunday, June 7, 2015

Isomorphic Web Apps: Back to the Future, Again

As web application development evolves, we continue to see the pendulum swing between client and server. Over the past two decades we have moved from simple multi-page HTML applications that are rendered exclusively on the server to ultra fat single page applications (SPA) containing more javascript than anyone would have imagined a few years ago.

Over the past couple of years, many large hosted sites (i.e. Airbnb, Facebook and others) have run into challenges with building heavy javascript client apps and have rediscovered the value of rendering some of the web content on the server. Technology such as Node.js has made this easier and so has the creation of frameworks such as ReachJS. This rediscovering of using the server for rendering UI now has a new cool name, Isomorphic Javascript. The name seams to have stuck, so we will need to add it our lexicon :)

The technology around this new approach is gaining some steam of late. Here is a good blog from from Airbnb on what led them to consider this architecture for their hosted web application services. While the idea for moving away from SPA has been around for while, it is gaining more steam of late and we will for sure start to see more of the established front-end JavaScript frameworks incorporating it in one way or another as well as new frameworks such as ReachJS.

ReactJS is one of the more popular frameworks that leverage server side rendering and that advocates for this hybrid web application development. While Node.js is the leading container for supporting this application delivery model, we will start to see JVM support and integration as well with Java 8 Nashorn.

There are many benefits to building your web application with an isomorphic javascript architecture that I will try to cover in an up coming blog. There are already some good blogs covering the subject. Also expect AngularJS 2.0 to offer support for server side rendering, but we will have to wait and see what Google comes up with as AngularJS 2.0 gets further along.

So keep an eye out for this new twist in web application development. It will will be a boost for mobile development as well since mobile can certainly benefit from some server-side offloading of processing. But like most things, this new technology approach is no free lunch. Isomorphic javascript does add some complexity to constructing your web applications. Some of this maybe alleviated as web application frameworks evolve and as HTML web component standard mature. Stay tuned.

Saturday, May 9, 2015

A Future Writen in TypeScript?

Web developers! Get your TypeScript engines started. Sad to say that Dart is dead, but TypeScript is a much more natural evolution toward ECMAScript 6 and a more team scalable, structured and manageable extension to JavaScript programming (long live static typing :) to help bring web development out of the wild wild west.

Here is how AngularJS 2.0 is influencing the future of web development:
https://blog.mariusschulz.com/2015/03/06/angular-2-and-typescript

Saturday, February 7, 2015

Machine Learning - Algorithim & Category Breakdown

Thursday, November 13, 2014

Web Components are Real

Web Components are not another internet buzzword. Web Components are a collection of web browser constructs and standards that will modernize client side web development and improve the web design process overall. This is a long time in the making, but these are the missing building blocks (along with continued ECMAScript maturity) that are needed to bring web development on par with traditional structured programming languages and environments without the need the crazy hacks we have today.

The key standards behind Web Components include:

Shadow DOM: Finally DOM trees that don't step on each other. Modular DOM structures can exist and interact with each other.
Custom HTML Elements: HTML building blocks where each custom element can have encapsulated properties functions and events. Elements can exist in a hierarchy/nesting and look and act like native HTML elements.
HTML Imports: Import HTML pages and source files like other programming languages.
CSS Grid Layout: Table and grid layout done in a more intuitive way and more akin to how most client GUI frameworks handle widget layout.

These standards will impact low level frameworks such as jQuery, but will also change the way higher order client side frameworks like AngularJS, GWT, Ember, Knockout evolve over time and how they provide wiring, plugin and extension capability to their developers.

So get ready for Web Components. They are real and will finally bring modular and structured web programming to the web to support more robust, scalable and maintainable, extensible client side development frameworks.

P.S. Keep an eye on the Polymer project if you want to experiment with Web Components today. This client side framework, packages many of the emerging standards into a developer friendly API and programming model. But keep in mind that Polymer is not Web Components, it is just a project that demonstrates the power of these new Web Component standards.

Sunday, September 14, 2014

JobServer Release 3.6.14

We are happy to announce the release of JobServer 3.6.14 which introduces LDAP support and improved shell script processing to allow turning any standalone program or shell script into an easy to automate and track application. Yes, with JobServer you give your shell scripts and batch standalone programs a GUI front-end that you can use to customize your shell scripts and leverage powerful reporting and monitoring to easily track all input and output related to your batch scripts and standalone programs.

With this release, JobServer now supports improved tracking of shell script output via the JobServer JobTracker reporting and tracking application. You can now preview the standard output of every shell script right from the top level JobTracker search report. You can also now run shell script jobs manually and pass custom input parameters to the shell scripts. Using JobServer with batch scripts just got a whole lot more fun and productive.

Want to simplify user authentication for you and your JobServer end users? Now with LDAP support, you can integrate JobServer with your Active Directory and LDAP compatible environment for more seamless user authentication.

Download and test drive JobServer 3.6.14 today and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop and predictive analytics consulting services that maximize your Big Data investment.

Saturday, August 16, 2014

Tableau for Agile Oracle Essbase Financial Reporting

Oracle Hyperion Essbase is an established multidimensional database platform often used by accounting departments to model and store their company's financial data. Essbase comes with out of the box Oracle web reporting utilities to help you visualize your financials for management and also comes with integration with tools such as MS Office for reporting via Excel.

In the traditional desktop Excel world view, you end up passing around lots and lots of excel spreadsheets and statically built PDF reports around your organization and with your executives - a bit antiquated in this day and age to say the least. You will however find that Excel, to its credit, is commonly used to build fairly advanced reports with Essbase using home grown Excel and VB programming. But there has to be a better way that does not involve building complex data warehousing, ETL and using outdated BI visualization tools.

Now Tableau is a fast growing and popular visualization BI solution that enables business analysts (without advanced technical expertise) to perform data discover and build rich and sophisticated visualizations that can be more easily shared than traditional Excel spreadsheet sharing. Tableau has emerged as a powerful replacement for Excel based reporting and a challenger as well to the established enterprise BI platforms such as Microstrategy and Cognos to mention a few. Tableau fits well as an agile replacement for Excel reporting while allowing users to build very powerful next generation reporting and dashboards that outperform the traditional enterprise BI vendors in agility and visualization capabilities.

Tableau still has a way to go on the enterprise end, but it is coming on strong and if you know how to deploy and implement Tableau Server you can build highly agile and visually rich enterprise grade BI solutions. For financial reporting, Tableau allows you to take your legacy Essbase reports and spreadsheets out of the dungeon and into the light of day by allowing you to build sophisticated dashboards that can be easily accessible across your orgaization via Tableau Server by all your executives.

With Tableau you can just say no to having to build yet another data warehouse and complex ETL when architecting your business intelligence strategy. But be aware, Tableau can be used to extract data from Essbase directly using the built-in Tableau to Essbase connector, but say no to this approach. The Tableau Essbase connector will not work (needs another blog). We strongly suggest not using the Tableau Essbase cube connector for a number reasons (not all Tableau related). This connector has many challenges. A hint - extract your Essbase data using the Essbase Excel plugin and mix with a little ETL and output to denormalized flat data structures. Say what? Yes this approach rocks! Remember that Tableau is great at extracting dimensionality out of your data (that is one of its claims to fame actually).

At Grand Logic, we have developed an elegant and straight forward approach to extracting data from Oracle Essbase for agile and efficient consumption by Tableau. This in turn can be used to build advanced financial reports and dashboards without a huge investment in data warehousing and ETL processing. Our approach to integrating Tableau with Oracle Essbase leads to a powerful solution that will leave your executives wanting more and frees your accountants and financial analysts from building cumbersome to maintain Excel reports. Get your financial reporting and dashboarding in Tableau today for centralized access and in an environment governed by one version of the truth. Put actionable and insightful data in the hands of your executives.

Are you also looking to invest in Big Data infrastructure and analytics? Essbase does not have to be an isolated island of data divorced from your Big Data initiatives. Read more on how you can integrate Essbase data with your Big Data analytics.

Looking to get your Essbase cube into a Big Data lake? Learn more how you can integrate Essbase with Tableau and Apache Spark to supercharge your Tableau and Essbase connectivity.

Tableau and Essbase can be a great combination for building rich reporting and dashboards and without the overhead and complexity of traditional data warehousing and BI. Get your financial data out of Essbase and into Tableau and into the hands of our executives and decision makers. Contact Grand Logic to learn more.

Wednesday, February 5, 2014

Machine Learning: The Brains Behind Big Data

The first round of the data revolution has focused around commoditizing computing and storage. Platforms such as Hadoop and NoSQL have helped to propel this and have enabled businesses to economically deploy more powerful scale out infrastructure than before. It has also changed and improved the way data warehousing and business intelligence is approached and managed. The storage and performance capabilities of Big Data have been a game changer. Traditional descriptive BI and reporting will never be the same. But this is just step one. The best is yet to come.

The industry is now going through a learning processes with how to manage all this data at massive scales. Storing and managing more data is great, but people and businesses will get smarter at how much data to keep as it starts to hurt more (hurt the pocketbook). How much data you keep and mine will depend on statistically driven best practices and not just about data warehousing or how big your HDFS cluster is. The mainstreaming of Big Data has provided the muscle to store and process massive amounts of data at near linear scale, but we will not see the real value of all this Big Data storage and processing until machine learning and data science tools become more assessable (to the non-PHD data scientists among us) and mainstream and businesses learn how to apply these tools and disciplines effectively.

Machine Learning will provide the brains to go along with the Big Data muscle. In the long-run businesses will decide how much data to keep around based on statistical measures and best practices as they grow to understand their data and their business better as they build out developing their predictive and prescriptive analytics.

Sunday, October 20, 2013

JobServer and Mesos Make a Great Pair

We are happy to announce the release of JobServer 3.6 beta1 with support for Mesos clustering and distributed job processing. Release 3.6 is an early access release of JobServer with integrated support for Mesos. With this release of JobServer, you can now schedule and run jobs on a Mesos cluster of any size and configuration. Say goodbye to cron jobs!

JobServer has always had support for distributed job scheduling and processing and a great replacement for cron. Now, with Mesos integration, JobServer takes this to next level by incorporating support for dynamic resource management and reliability by leveraging all the advantages of Mesos. JobServer also brings powerful scheduling, reporting and monitoring features to Mesos environments. Distributed job scheduling and batch processing just got more interesting!

With this release you can track and manage jobs as they run across a dynamic and highly resilient cluster of servers. JobServer with Mesos allows you to run scripts and jobs across your cluster of servers and manage how resources are utilized and managed. If you are a Mesos user today, give JobServer a try and say goodbye to cron. If you are a JobServer user, get your compute resources under control with Mesos.

Download the beta release of JobServer v3.6 and tame your IT environment using all the advantages of Mesos and JobServer.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.

Friday, October 18, 2013

Mesos: The Datacenter is the Computer

The data center is the computer. The pendulum is swinging. Traditional cloud and virtualization level resource management in the data center are no longer good enough to efficiently manage the growing demands for computing services needed in the enterprise. The answer for this challenge, to offer more compute and storage services more efficiently, are solutions such as Mesos and YARN. These emerging cluster management platforms are the next evolution for fine grained and efficient resource management of your data center infrastructure and services. As the need for more processing and storage grows, solutions like YARN and Mesos take center stage.

Big Data, mobile, and cloud computing have driven a tremendous amount of growth and innovation, but the byproduct has been more and more computing infrastructure needed to service the growth and manage the explosion of data. This has especially been the case as we have moved to using more clustered commodity hardware and distributed storage. You now have start-ups and smaller companies managing complex multi-node computing infrastructure for things like Hadoop, real-time event streaming, social graphs as well as for managing established core services like data warehousing, ETL and batch processing. All this has put a lot of demands in order to effectively manage and administrator a dynamic hardware computing environment and in many cases it has created isolated silos of resources dedicated to different tasks, for example, your Hadoop cluster is separate from your application services, database servers and legacy batch processing. This does not scale and it not cost effective.

These silos have created inefficiencies within that data center and the enterprise environment. For example, if your Hadoop cluster of 10 nodes is running only 70% of the time at maximum capacity, what are those 10 nodes doing the other 30% of the time? Same can be said for the other services running in the data center. Unless you can treat your entire data center as one shared cluster of resources, you will have inefficiencies and as the number of nodes and services you are managing grows, these inefficiencies will only increase. This is were solutions like Mesos can step in and give your applications and services one holistic view of your computing infrastructure. By using Mesos, you can reduce costs and more efficiently utilize the hardware and storage resources you already have and it allows you to grow more incrementally as more resources are needed.

Companies like Google, Twitter and Facebook are leading the charge to advance the state of art for efficient data center and enterprise computing. Mesos is a great tool and platform to leverage to reduce costs, improve reliability and overall operational efficiency of your operational IT environment. Give Mesos a look. Cheers!

Thursday, August 15, 2013

Protecting your Hadoop Investment

The hype and buzz around Big Data in the tech industry is at astronomical levels. There are many factors driving this (both technical and human), but I won't get into that here. The fact is that Big Data (define it as you like) is here to stay and many organizations need to find their path to the wonderland of bottomless data storage and boundless analytical computing where no byte of data is every thrown away and any question can be asked and answered about your data. Well, at least if you are Facebook or Google.

Hadoop is the leading contender to enable organizations to economically and incrementally take advantage of distributed storage and scalable distributed processing to tackle the Big Data challenges ahead. The days of buying expensive vertically scaling servers and expensive storage systems are over. Hadoop started from the humble beginnings of Map Reduce and distributed storage (HDFS) and now it has expanding to touch and integrate with all corners of the enterprise computing fabric from real-time business intelligence to ETL and data warehousing. These days, most any company with some kind of database or software analytics solution has now put the word "Big" in their title and offer some level of Hadoop integration. Nothing really bad about that, and it is great to see everyone gravitating to the Hadoop ecosystem as an open source standard of sorts for Big Data.

Hadoop presents a lot potential to solve problems that in the past required much more expensive and proprietary systems. Note, that Hadoop in many respects is no less complex (and is by no means free) from past and existing propriety Big Data platforms, as Hadoop has its own complexity challenges such as many distributed hardware moving parts and is a more or less a loose collections of many open source projects. Hadoop has a lot of creative minds and companies driving its fast evolution. But it is not out of the box a plug and play solution nor a one size fits all solution by any stretch of the imagination. Hadoop does not come cheap by any measure, but with Hadoop you have more opportunity to grow your Big Data system as you go, and with the potential with less vendor lock-in and more flexibility over what you pay for (note, I use the world potential here). The value you get out of Hadoop depends on your expectations and on your investment in people and training along with key decisions you make along the way.

So how does an organization begin down the road of figuring out how Hadoop fits into their existing ecosystem and how much and how fast to invest in Hadoop? Let's see if we can walk through some common questions, challenges and experiences one would go through as they begin their Hadoop quest.

First you need to understand what makes Hadoop tick.
It is important to understand that out of the gate Hadoop does not necessarily invent anything that has not existing before in other products. There are some novel concepts in Hadoop, but overall Hadoop offers nothing altogether new. There are some cool innovations in Hadoop, but fundamentally Hadoop is about a few key concepts. It is founded on the concept of distributed computing and distributed storage using commodity hardware. But ultimately Hadoop is about growing your data storage and processing in an incremental and economical way using largely open source technology and off the shelf hardware. Note, open source does not mean free of course.

Okay, so what problem do we want to solve with Hadoop? Please don't say all of them.
One of the nice things about Hadoop is that organizations of any size can adopt it. You can be a small startup with and simple idea and run your Hadoop on a small clusters on Amazon or you can be a larger enterprise and have a massive clusters performing high-end processing, such as crawling and indexing the entire web. Hadoop can be used in a variety of situations such as to reliably store large volumes of data on commodity storage or it can be used for much more complex computing, ETL, NoSQL and analytical processing.

For larger organizations that are getting started with Big Data, it is vital to identify some key problems you want solved with Hadoop and that might fit and integrate well with existing legacy systems. Hadoop is particularly good at being a holding area for unstructured data like web or user logs that you might want to keep in raw format for later analysis and auditing, for example. What is typically important is to start small and solve some specific problems on specific data sets and then expand your application of Hadoop as you go. This includes getting accustomed to the many programming and DSL packages that can be used to process Hadoop data.

Hey, in a Big Data universe we never throw anything away.
Some of the talk circling around Big Data often mentions how the typical application of Hadoop is to always store everything forever. Obviously this is not practical. Now, many vendors that are providing software and hardware for Hadoop would love for you try to do this, but the reality is that you still need to understand your data limits and have clear aging and time to live policies. Hadoop does let you scale your storage out to petabytes, potentially, but there is no free lunch here. Also, a critical aspect to this is understanding the format you store your data in, within Hadoop. Again here, you hear a lot of talk about storing all your data in "raw format" so you can have all the details in order to extract deep information form your data in the future. While this sounds great in theory, again this is not practical in most cases. In reality, you can keep some data in raw format, but you must typically transform your Hadoop data in other formats besides just unstructured HDFS sequence files, for example. Structure does matter as you get into more complex analytics in Hadoop. Storing your data in HDFS also often means transforming it into semi-structured column stores for use by tools such as Hive and HBase and other query engines, for better performance. So structure matters and expect to have your data stored in Hadoop in possibly multiple formats or at least transformed via Hadoop based ETL into formats other than the "raw" acquisition format. This all adds up to more and more storage requirements. So make sure you understand the math to properly size your Hadoop storage needs.

Now this software is open source which means mostly free, right?
Obviously we have all learned by now that open source does not necessary mean free. Red Hat, as an example, has a pretty good business around open source and they are quite successful at making a profit. Hadoop vendors are no different. There are several well funded start-ups that have Red Hat like business models around Hadoop, not to mention all the big boys trying to retrofit their existing Big Data solutions to be Hadoop friendly. None of them are free, but they all are different from each other. And it is important to understand each Hadoop vendor's strengths and weakness and where they are coming from. The vendor's history does matter for a lot of reasons that I will discuss in a later post.

Now, in theory you could go it alone, and use Hadoop completely free - just download most of the Hadoop packages from Apache (and a few other places). For example, I have downloaded and installed versions of Hadoop from the Apache Foundation and have been ale to run basic Map Reduce and HDFS jobs running on small clusters - all for free and without going through any Hadoop vendors. You can also use community versions from the various Hadoop distributions from the major Hadoop vendors. This can work, but you are on your own and how feasible this approach is depends who you are and how savvy your technical staff are. It is also important to understand how the various Hadoop distributions and players differ from each other and how much you are getting "locked in" with each Hadoop vendor. The retro-fitted Hadoop vendors (as I call them) have a lot more polish and savvy when they pitch Hadoop to you while some of the Hadoop startup vendors have varying degree's of proprietary software embedded in their Hadoop distributions. It is critical to understand these facts and it is important to consider how much you are willing to build on top of Hadoop yourself vs relying 100% on your Hadoop partner. These are important considerations that can sometimes get lost in internal management jockeying over who will be the Big Data boss. Vendor lock-in is very important to understand along with clearly planning for sizing, capacity and long-term incremental growth of your cluster.

This all leads to understanding the cost of Hadoop as you set expectations over what problems you want your Hadoop cluster to solve from day one. Sizing your Hadoop cluster for storage, batch computing, real-time analytics/streaming, and data warehousing must be considered. How you capacity plan your storage, HDD spindles, and cpu cores are critical decisions as you plan the nuts and bolts of your Hadoop cluster. Your Hadoop partner/vendor can help you with this sizing and planing, but again here, each vendor will approach it differently depending on who they are and who you are (how deep your pockets are). You have to be smart here and know what is in your best interest long-term.

Your Hadoop cluster is not an island.
It is vital to consider how your Hadoop cluster will fit in with your current IT environment and existing data warehousing and BI environments. Hadoop will typically not totally replace your existing ETL, data warehousing and BI systems. In many cases, it will live alongside existing BI systems. It is also vital to understand how you will be moving data efficiently into your Hadoop cluster and how much processing and storage is needed to put data into intermediate formats for optimal performance and efficient consumption by applications. These are critical questions to answer in order to get your Hadoop cluster running efficiently to effectively feed downstream systems.

You mean my Hadoop cluster does not run itself?
One under estimated area concerning Hadoop, is planning for the operations and on-going management of your Hadoop cluster. Hadoop is good technology, but is fast evolving and has many move parts both at an infrastructure level (lot of nodes and HDDs) and from software package perspective (lot of software packages that are fast evolving). This makes running, monitoring and upgrading/patching Hadoop a non-trivial task. For example, many of the Hadoop vendors offer both open source and proprietary solutions for managing and running your clusters. This obviously requires your operations and production IT staff to be included in the planning and management of your clusters.

Some other important questions and considerations as you get started with Hadoop.

How will multi-tenancy and sharing work if more than one group is going to be using your cluster.
Should I have one or a few big Hadoop clusters, or many small clusters
Understand your storage, processing, and concurrency needs. Not all Hadoop schedulers are created equal for all situations.
Do you need or want to leverage virtualization and or cloud bursting?
Choose your hardware carefully to keep costs per TB low. How to mange TB vs cpu/core is important.
Understand what you need in your edge nodes for utility and add-on software.
Plan your data acquisition and export needs between your Hadoop cluster and the rest of your ecosystem.
Understand your security needs at a data and functional level.
What are your up time requirements? Plan for rolling patches and upgrades.

Maybe I should have stated this in the beginning, but the reason I called this blog Protecting your Hadoop Investment, is because many organizations enter into this undertaking without a clear understand of:

Why they are pursuing Big Data (other than it is the hot thing to do).
How Hadoop differs from past propriety Big Data solutions.
How it can fit along side existing legacy systems.
How to ultimately manage costs and expectations at both a management and technical level.

If you do not understand these points, then you will waste a lot of time and money and fail to take effective advantage of Hadoop. So, strap in and enjoy your Hadoop and Big Data adventure. It will be a journey as much as a destination and it will transform your organization for the better if you plan appropriately and enter into it with your eyes wide open.

Thursday, July 25, 2013

JobServer 3.4.28 - Isolated JVM Containers

We are happy to announce the release of JobServer 3.4.28 which adds a number of new features for administrators along with supporting the latest version of Google Web Toolkit and expanded remote management APIs.

With this release, JobServer now supports expanded remote web services programatic APIs. Also included in this release is the capability to run distributed jobs under customizable Linux/Unix userspace accounts on a job by job basis, which gives administrators fined grained control over how they run their jobs. This allows users to run jobs inside isolated JVMs in a more granular fashion.

It has always been our focus to make JobServer the most developer and IT friendly scheduling and job processing platform on the planet. We are proud of our focus on taking customer and developer feedback to continuously make JobServer the best scheduling and job processing engine around. JobServer tames your job processing and scheduling environment in a way that is a joy for Java developers to customize upon while providing powerful web UI management and administration features for business users and IT operations administrators.

Download and test drive JobServer 3.4.28 today and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.

Saturday, June 1, 2013

Big Data is More Than Correlation and Causality

There is no discounting that the Big Data movement is getting a lot of attention from all avenues of business and technology. Large scale computing has been around for decades, since the days of super computers, and has been brought to the forefront of late by the high flying internet companies. This has been driven in part by significant advances in the availability of commodity hardware, open source distributed computing software, cloud computing, and virtualization among other things.

A lot of the debate as to the value and benefits of Big Data is largely centered around how it can benefit companies in analyzing large data sets to help them make marketing type decisions such as recommending what movie or product you should buy and thus improve the bottom line of these businesses. There are also other applications such as the analysis of vast volumes of sensor or transactional data in order to find patterns using machine learning. The possibilities for applying Big Data are abound for both analyzing structured and unstructured data in order to extract information and improve marketing and overall business decision making.

Correlation vs Causality
One common debate about Big Data is the effectiveness of the analytics applied in Big Data solutions, and whether it really can discover answers to questions or is it just better suited for correlations and not necessarily best suited in identifying precise causality. These debates are good discussion to have and in general I think Big Data can serve many purposes from finding correlations to solving very specific problems from a wide spectrum of data sources. The ability to extract value from Big Data is driven in part by the volume of data available and applying the right machine learning algorithms. However, I believe there is a much bigger value to be gained from the Big Data computing movement than just correlations or sifting through transactions to calculate some metric or finding a needle in a hay stack from petabytes of data.

Insights are not Enough
Extracting insights from vast volumes of structured and loosely structured data has many applications, but the ultimate application of this is enabling computing systems to make smart and intelligent decisions with less and less human involvement. This is what leads to lower costs and improved productivity and what has historically been part of the human evolution where it relates to technology. We have evolved over the decades to have machines do more work for us, so the smarter our machines get and the more autonomous they get the more we evolve as a technology driven society.

Automation and Intelligence
Ultimately Big Data can help us go beyond just a discussion around finding correlations or summarizing metrics to generate visually captivating reports. The ultimate benefit business can gain from Big Data is no different from what it has always been in the past with other computing and communications technology advances. It is about automation in its simplest form and in the most advanced form it is about enabling software and computers to power artificial intelligence to enable system autonomy. The smarter and more independent our systems are the more we advance and the more efficient business becomes. This drives getter productivity and effectiveness in all aspects of business. This will, for example, allow us to build power plants that run themselves much more efficiently, to build computers like IBM Watson that can make human like decisions, to automation software like Siri and Google Now that can understand what we want and deliver the right information to us at the exact time we need it. So Big Data is many things, but ultimately it will turn our computers and data into information that will automate all aspects of our lives and make business more efficient and productive.

The Time for Artificial Intelligence is Now
With advances in distributed computing, networking, and storage the time has come for AI to be at the heart of what of Big Data is all about. Big Data will allow AI to achieve the potential we have all dreamed it could be. AI has never achieved many of the scifi type capabilities we have all grown up watching on TV and in movies. Big Data will be what allows AI to achieve its full potential and this will make many things we only dreamt of possible.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Tuesday, May 14, 2013

Hadoop the New 'T' in ETL

ETL is a common computing paradigm used in a variety of data movement and data management scenarios. As demand for more insight into business data as grown, ETL has been used to move more data from operational data stores into OLAP and data warehousing environments. This has expanded the need for analytics and other solutions that rely on data being reconstituted into easier to consume forms or data models more efficient to solve specific problems.

So nothing special going on here, but as data volumes have grown and sources of data have exploded, the transformation part of ETL (the "T") is becoming more of a challenge, especially as organizations demand more near real-time analytics and up to date information. Transforming the volumes of operational data is becoming a computing bottleneck and often limits what you can do with data after it has been transformed and loaded into downstream data marts. See a typical ETL data flow diagram below.

Big Data to the Rescue
With the evolution of big data and Hadoop, new tools have been brought to bear that can provide help in the overall ETL computing process. However, with Hadoop, the ETL model needs to be revisited. Hadoop can bring tremendous computing resources to more efficiently transform data into target models. While Hadoop can serve as part of your overall processing fabric and can be leverage directly for OLAP and itself be used for data warehousing (e.g. HBase data store), it can also serve as a intermediate staging area that can be used to populate traditional relational data marts.

Using Hadoop in this way allows it to be used as an intermediate store for data until it can later be transformed into target models. We can accomplish this "load first" approach using Hadoop, by changing the ETL model around a bit. Instead of extracting and transforming data first, we can instead extract and load data into Hadoop storage, for staging, and then take full advantage of the Hadoop compute infrastructure to transform (using Map Reduce, Impala, Drill…etc) the data into target models that can feed traditional relational data marts and OLAP engines. See diagram for example:

Hadoop for Transformation
This essentially allows organizations to use Hadoop as the transformation platform that allows developers to perform more complex transformations that were not practical in the normal ETL universe. So think of Hadoop as the new super charged "T" in the "ELT" paradigm, where data is moved as efficiently as possible from operational stores and loaded ("L") into HDFS (and HBASE or Cassandra) as fast as possible. Then the "T" can be performed within the Hadoop ecosystem. This allows Hadoop to be a powerful intermediary layer that can drive new analytics and allow existing analytics to keep up with the deluge of data. This also allows existing OLAP and data warehouses to continue to consume data out of Hadoop for existing analytics.

So let us start getting used to the concept of "ELT" as the new big data cousin of ETL. Hadoop is more than just a historical archive or dumping ground for unstructured data. It can be a powerful transform computing layer that can drive better data warehousing for new and existing analytics solutions.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Friday, April 26, 2013

Data In and Out of the Hybrid Cloud

The continued adoption, and accleration of interest, of cloud computing has creating an interesting problem for the large enterprise. Most large enterprises have existing private network and intranets. As parts of the organization begin to adopt public and private clouds, there is a challenge of moving data in and out of these clouds.

Data movement is more challenging for a number of reasons including security and network accessability. For example, you can usually can't walk up to these clouds and load a tape or hard drive. And badwidth is often times more restricted than that between internal networks. Appliations you have running in your private network may not be able to talk directly to the cloude without going through web services or other new networking schemes.

Then there is the question of cost. When does a business decide it is time to bring back data from the cloud? There is a point where keeping certain data in the cloud could become cost prohibative. While the cost of cloude computing and storage is always going down and newer services popup up all the time (like Amazon Glacier for example), this issue is not going away.

Also for Big Data type computation and data storage, at what point is keeping your data running in an Amazone EMR or stored in S3 beceome prohibatively expensive? All these questions are important for organizations to understand as the adoption of cloud computing and Big Data analytics accelerate. There is no simple answer of ourse, but it is important for organizations to consider these questions with both from an IT and financal perspective.

Saturday, April 20, 2013

Databases are Cool Again

It is definitely interesting what is happening in the database space these days. It is good to see the NoSQL and NewSQL folks spark a fire under the traditional relational vendors. This is the only way to inspire innovation both in the commercial and open source space. To a large extended, the establishted relational player were slow to jump on the cloud and even to this day they are not moving as aggressively as they need to in order to reclaim the cloud database market from the NoSQL and NewSQL upstarts.

Fundamentally, the scale-out capabilities of traditional RDBMS engine still don't hit the sweet spot developers and cloud operations people need these days. I do expect the market will consolidate somewhat in the next several years as there are just too many players at the moment, especially on the NoSQL side. But I expect there to remain a large selection of NoSQL engines over time, as many of the NoSQL players target specialized areas so it is definitely not a one size fits all like it has historically been with relational database. For example, many of the NoSQL engines have made deliberate enginering trade-offs in their products such as in their storage models, consistency, replication, aggregation capabilities, and scale-out…etc. For example, if you need a NoSQL with strong aggregation functions you might choose MongoDB but if you need something that scales out writes and data center replication you might go with Cassandra. So, in the long-term I do not see a single NoSQL that can rule them all.

Tuesday, April 16, 2013

Cloud Job Scheduler on EC2

Grand Logic is pleased to announced a new release of our JobServer Cloud edition product. JobServer Cloud edition allows our customers to access and use the powerful job scheduling, job processing, workflow and SOA messaging features available in JobServer from a cloud environment. JobServer now supports deployments on Amazon EC2, allowing our customers to lower their IT costs and free themselves to focus on their core applications. If you are using EC2 to host your applications, and you are in need of a job scheduling and processing solution, then JobServer is a perfect choice.

JobServer Cloud delivers all the same great features and capabilities found in our core JobServer software and can now be hosted in the clouds. This frees customers from the IT burden of buying and maintaining hardware and installing and managing their own IT environment for JobServer. With the cloud edition of JobServer, you are freed from dealing with upgrades and managing maintenance tasks and dealing with hardware issues. We can add more job processing capacity, quickly, as you need it. You just need to focus on building and deploying your jobs and java Tasklets. We will ensure that your JobServer environment is well maintained, managed and backed up and running efficiently. We will alert you if we notice performance issues or slow running jobs and processes or if more capacity is needed. Daily and weekly reports are delivered that provide detailed job scheduling and processing statistics. Contact our support team to get setup with a fully managed instance of JobServer.

With our fully managed JobServer Cloud solution, there is no hardware or software to install! With JobServer Cloud, your environment will be deployed and run from the Amazon cloud with secure connectivity to your private network using Amazon VPC (Virtual Private Cloud). You can access your JobServer instance securely to run and manage your jobs and apps with full access to your private corporate network. Many customers also need their JobServer environment to connect and access local IT systems and services within their private corporate network. With Amazon's VPC (Virtual Private Cloud) we can securely bridge between a company’s existing IT infrastructure and your JobServer Cloud environment. Amazon VPC enables enterprises to connect their existing infrastructure to a set of isolated JobServer compute resources via a Virtual Private Network (VPN) connection, and to extend their existing management capabilities such as security services, firewalls, and intrusion detection systems to include their JobServer resources that are in the Amazon cloud.

If you want to install and manage your own JobServer instances, then download JobServer today and install on your EC2 environment to start scheduling and processing jobs. Start with one EC2 instance or scale JobServer to run on hundreds of EC2 instances. JobServer scales easily and effectively on EC2 to allow you to run thousands of jobs. JobServer can scale to meet your needs by taking full advantage of your Amazon cloud environment.

About Grand Logic
Grand Logic is dedicated to delivering software solutions to its customers that help them automate their business and manage their processes. Grand Logic delivers automation software and specializes in mobile and web products and solutions that streamline business.

Saturday, March 9, 2013

Putting NoSQL in Perspective

Deciding between a NoSQL database or a relational database system is about understanding the trade-offs that led to the creation of NoSQL to begin with. NoSQL systems have advantages over traditional SQL databases because they give up certain RDBMS features in order to gain other performance, scalability and developer usability capabilities.

What NoSQL gives up (this varies by NoSQL engine):

Relationships between entities (like tables) are limited to non-existent. For example, you usually can't join tables or models together in a query. Traditional concepts like data normalization don't really apply. But you still must do proper modeling based on the capabilities of the particular NoSQL system. NoSQL data modeling varies by product and whether you are using a document vs column based NoSQL engine. For example, how you might model your data in MongoDB vs HBase varies because each solution offers significantly different capabilities.
Limited ACID transactions. The level of read consistency and atomic write/commit capabilities across one or more tables/entities varies by NoSQL engine.
No standard domain language like SQL for expressing ad-hoc queries. Each NoSQL has its own API and some of the NoSQL vendors have limited ad-hoc query capability.
Less structured and rigid data model. NoSQL typically forces/gives more responsibility at the application layer for the developer to "do the right thing" when it comes to data relationships and consistency. Think of NoSQL as a schema on read instead of the traditional schema on write.

What NoSQL offers:

Easier to shard and distribute the data across a cluster of servers. Partly because of lack of data relationships it is easier to shard and distribute data across a cluster and capacity more incrementally and horizontally. This can give much higher read/write scalability and fail-over capabilities, for example.
Can more easily deploy on cheaper commodity hardware (and in the cloud) and expand scalability more incrementally and economically.
Don't need as much up-front DBA type of support. But if your NoSQL gets big you will spend a lot of time doing admin work regardless.
NoSQL has a looser data model, so you can have sparser data sets and variable data sets organized in documents or name/value column sets. Data models are not as hard wired.
Schema migrations can be easier but puts burden on application layer (developer) to adjust to changes in the data model.
Depending on what type of application you are building, NoSQL can make getting started a little easier since you need less time planning for your data model. So for collecting high velocity and variable data, NoSQL can be great. But for modeling a complex ERP application it may not be such a great fit.

Like with most things, there is no sliver bullet here with NoSQL. There are many different products available these days and each has its own particular specialties and pros/cons. In general, you should think of NoSQL as a complementary data storage solution and not as a complete replacement of your relational/SQL systems, but this will depend on your applications and product functionality requirements. For example, how you use NoSQL in a analtyics environment vs a OLTP setting can greatly effect how you use NoSQL and which specific NoSQL engine you choose. Keep in mind you may end up using more than oen NoSQL product within your environment based the specific capabilities of each - this is not uncommon.

Relational databases are also evolving, for example, new hybrid NoSQL oriented storage engines are coming out based around MySQL. Also products like NuoDB and VoltDB (what some are calling NewSQL) are trying to evolve relational databases beyond the vertical scaling and legacy storage computing restrictions of the past, by using a fundamentally different architecture from the ground up. Keep you seat belts fastened, the database landscape has not been this innovative and fast moving in decades.

Tuesday, March 5, 2013

SQL and MPP: The Next Phase in Big Data

Over the past couple years we have all by now heard about the Big Data movement. Two key enablers in this remaking of analytics, data warehousing and general computing have been the NoSQL database technology movement and the emerging Hadoop compute stack. While not directly related to each other, both NoSQL and Hadoop have become associated with the rapidly accelerating Big Data revolution as more companies look to manage larger and larger data sets more effectively and economically. NoSQL has been the new kid on the block in the database space by attempting to take applications and data to the promised land of web-scale computing where traditional relational databases have fallen short. Over the past decade, SQL and relational database technology have failed to effectively keep up with developer needs and the scaling demands of a new generation of social and data heavy applications and this has opened the door to a different approach from the NoSQL camp. Hadoop is also in the same position by promising to deliver analytics and offline batch computing power not practical or cost effective before the emergence of HDFS and Map Reduce, which are for the most part, currently only found in expensive and proprietary analtyics products.

As with many proclaimed revolution such as Big Data, this is just the tip of the iceberg as they say. Like with most transformations in technology there is more to come as these technologies penetrate into more industries and gain wider adoption and broader acceptance by the open source community and the established heavy hitters. The next wave of Big Data technology will push the edges into other domains and go beyond the offline computing boundaries of HDFS and Map Reduce. While SQL and relational database centered analytics has taken a back seat lately because of the emergence NoSQL, SQL as a domain language will get an uplift with the emergence of the next Big Data wave as we move past the basic offline Map Reduce paradigm and look towards more real-time computing engines that can enable MPP (massively parallel processing) computing. This will allow IT organizations to continue to benefit from the low cost of commodity hardware and horizontal scaling benefits brought about by Map Reduce and HDFS and now generalized further for real-time analytics.

While NoSQL has established itself as a technology that is here to stay, the traditional relational database paradigm is not gone by any stretch and still provides an invaluable ad hoc query function to analysts and developers alike. NoSQL products like Cassandra, HBase and MongoDB (to mention a few) solve a unique problem and are becoming key foundations of any web-scale computing stack whether for online CRUD apps or for offline analtyics. But that does not eliminate the need or diminish the power of relational SQL engines and SQL as a powerful expressive domain language. NoSQL is not a silver bullet but can be a powerful complementary solution to traditional relational data storage models. The NoSQL folks have used the classic engineering trade-off where they have exchanged certain features found in relational databases to gain greater horizontal scalability. I will not get into the details of this but I do not want to over simplify what NoSQL has done. At the heart of the trade-off is eliminating relationships between data entities for the benefit of allowing for greater horizontal scalability. NoSQL also give the developer a more flexible "on-read" schema model that has its benefits.

So what does this all mean? Well, expect NoSQL and the the current 1.0 Hadoop stack to continue to mature and become more mainstream - that is no-brainer. But for the next phase I see SQL (for ad hoc querying) and real-time MPP becoming part of this Big Data fabric and this will bring back the ad hoc capabilities of relational database but now with the horizontal scaling and cost effectiveness found in HDFS and Map Reduce.

You can see this next phase is already happening by just observing all the commercial products rushing to extend their traditional analtyics engines to work on top of Hadoop and all the investment going into taking Hadoop beyond its current it offline Map Reduce roots. They very from open source next generation MPP platforms, to cloud providers offering analytics as a service, to traditional data warehouse vendors extending their products to run on time of Hadoop to next generation relational database start-ups. Here is a sample of some of the players and products to watch:

Hadoop 2.0 Players

Cloudera - Impala
Cloudera is leading the charge to create a next generation open source MPP platform that builds on the core components of Hadoop (HDFS, Zookeeper, MR...etc) to enable real-time analytics of Big Data. The initiative is open source but primarily driven (at least for now) by Cloudera. This is also partly a recognition that Map Reduce and tools like Hive are fine for certain offline analytics and processing but are not a complete solution for real-time reporting and analytics.

MapR - Apache Drill
This is a similar project to Impala but channeled through the Apache organization and primary driven by MapR (Cloudera Hadoop competitor).

Hadapt
Vertical solution for Hadoop for organizations wanting a more SQL friendly interface to their Hadoop data sources.

Datameer
Another Hadoop vertical player that is trying to make analytics and reporting easier for the Hadoop stack.

Cloud Players

Google - Big Query
This is Google's cloud services that is a combination of a distributed data store coupled with a powerful SQL like ad hoc query engine (based on the Dremel language).

Amazon - RedShift
Amazon service to help businesses more economically build data warehouses in the clouds with ad hoc SQL query interface. Partially based on technology from ParAccel.

Old School Players

IBM - Netezza
While traditionally focused on enterprise data warehousing, IBM is evolving their stack to fit and play nice with Hadoop and other Big Data solutions.

HP - Vertica
HP's Big Data play. Like IBM and Terradata, HP acquired their way into the Big Data space.

Teradata - Aster Data
Teradata is a true old school player in the Big Data space when the world only centered around relational databases. Their acquisition of Aster Data changed that.

Next Generation SQL Players to Watch

NuoDB
NuoDB is the new kid on the block promising a new way to scale and build relational databases in the cloud. Their approach is more or less based on a peer to peer model that allows them to scale out (as they claim) while still delivering on the traditional capabilities of relational database such as read consistency and ACID transactions. While NuoDB is more focused on OLTP type processing its claim that it can scale horizontally while supporting a SQL relational model makes it potentially powerful for real-time analytics as well.

VoltDB
Another new age relational database engine that delivers horizontal scaling yet retaining SQL capabilities. Differs from NuoDB by taking a caching approach to meet scaling challenge.

For the next wave of Big Data innovation, the landscape is rapidly changing with both old and new industry players getting into the game. Big Data will no longer be limited to offline and long latency based analytics processing. The lines between OLTP, OLAP and Enterprise Data Warehousing are blurring as offline computing, real-time analytics and data storage models evolve and converge. Expect better technology options and improved cloud scalability at lower price of ownership as the competition heats up and the next evolution of Big Data matures. Pick a horse and run with it. Stay tuned.

Wednesday, February 27, 2013

Grand Logic a Featured Cloudera Partner

We are proud to be selected by Cloudera as part of their featured partner list and to join the largest and fastest growing Apache Hadoop ecosystem. Grand Logic has been a strong proponent of Apache Hadoop and the potential of Big Data computing.

Grand Logic has integrated support for Hadoop and other Big Data technologies into our flagship product, JobServer. This has brought enterprise job processing and automation to Big Data computing and converged traditional business automation and job processing with Big Data analytics and computing. There are no islands or barriers here. JobServer allows enterprises of all sizes from startups to Fortune 500 to converge their back office businesses processing, SOA assets, and Big Data computing infrastructure such as MapReduce, Big Query, Hive queries, Impala queries, Pig…etc, all under one job scheduling management platform.

Download and test drive JobServer today and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs (Hadoop jobs, SOA jobs, ETL jobs, BigQuery jobs, Hive Jobs…etc) and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Tuesday, December 18, 2012

Turning Scripts into GUI Web Applications

How would you like to be able to turn any Linux bash script, command line program or Windows batch script into a GUI driven web application that any business user can invoke and manage from a GUI interface? And all in just a few clicks. Well, JobServer provides some great features to make this happen. JobServer, allows you to embed or associate any script or command line program to a Tasklet that can then be run in a server-see job. And the job can be configured to be invoked and run by any business user from JobServer's web UIs. The script can be manually launched by the user (and customized on the fly) from the GUI or scheduled to run at later time or frequency.

For example, an IT administrator or developer can use JobServer to embed their Linux script into JobServer and then expose it as a user interface based web application for any business user to manually run, schedule, monitor and track. The administrator can easily customize and parameterize the job/tasklet to allow custom input parameters that the business user can pass in directly from the web GUI.

Give JobServer a try, and turn any Linux or Windows script or command line program into a GUI application that be run and tracked by business users. What is also great about this is that the developer or IT administrator has full control over the scripts being run and can throttle capacity and disable/enable the scripts at any time. And they can track who has been running the scripts/jobs. All this with just a few clicks and some cut and past you can turn your scripts into GUI web applications!

Download and test drive JobServer now and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Friday, December 7, 2012

JobServer 3.4.14 for Oracle RAC

We are happy to announce the release of JobServer 3.4.14 which brings expanded enterprise features to JobServer's scheduling engine and job processing platform. JobServer has always been designed from the ground up for massive job scheduling and processing scalability while being highly resilient in the face of hardware, network and database interruptions. Reliable, repeatable, reportable, and measurable job scheduling, processing and management has always been the centerpiece of our focus with the JobServer platform.

With this release, JobServer now supports Oracle RAC 11g and allows for hot failover at the database layer. By enabling Oracle SCAN configuration, JobServer can leverage Oracle RAC's dynamic failover and database routing capability allowing JobServer to continue to access critical database data and transactions during critical job scheduling and job processing functions.

"We are excited about our Oracle RAC support in JobServer as this brings another level of enterprise fault tolerance into the JobServer Platform". JobServer tames your job processing and scheduling environment in a way that is a joy for Java developers to develop and customize upon while providing powerful management and administration features for business users and IT operations administrators.

Download and test drive JobServer 3.4.14 now and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.

Wednesday, September 26, 2012

BigQuery: Data Warehouse in the Clouds

There are a lot of changes occurring these days with the Big Data revolution such as cloud computing, NoSQL, Columnar stores, and virtualization just to mention a few of the fast moving technologies that are transforming how we manage our data and run our IT operations. Big Data, powered by technologies such as Hadoop and NoSQL, is changing how many enterprises manage their data warehousing and scale their analytics reporting. Storing terabytes of data, and even petabytes, is now in the reach of any enterprise that can afford to spend the money on potentially hundreds or thousands of commodity cores and disks to run parallel and distributed processing engines like MapReduce for instance. But is Hadoop the right fit for everyone? Are their alternatives, especially for those that want more reat-time big data analytics? Read on.

A Little Background on Hadoop

With Hadoop and many related types of large distributed clustered systems, managing hundreds if not thousands of cpus, cores and disks is a serious system administration challenge for any enterprise big or small. Cloud based Hadoop engines like Amazon EMR and Google Hadoop make this a little easier, but these cloud solutions are not ideal for typical long-running data analytics because of the time it takes to setup the virtual instances and spray the data out of S3 and into the virtual data nodes. And then you have to tear down everything after you are done with your MapReduce/HDFS instances to avoid paying big dollars for long running VMs. Not to mention you have to copy your data back out of HDFS and back into S3 before your ephemeral data nodes are shutdown - not ideal for any serious Big Data analtyics.

Then there is the fact that Hadoop and MapReduce are batch oriented and thus not ideal for real-time analytics. So while we have taken many steps forward in technology evolution, the system administration challenges in managing large Hadoop clusters, for example, is still a problem and cloud based Hadoop has many limitations and restrictions as already mentioned. In its current form, cloud based Hadoop solutions are too expensive for long running cluster processing and not ideal for long-term distributed data storage. Not to mention the fact that virtualization and Hadoop are not a great fit just yet given the current state of virtualization and public cloud hardware and software technology - this is a separate discussion.

The BigQuery Alternative

So if I want to build a serious enterprise scale Big Data Warehouse it sounds like I have to build it myself and manage it on my own. Now, enter into the picture Google BigQuery and Dremel. BigQuery is a serious game changer in a number of ways. First it truly pushes big data into the clouds and even more importantly it pushes the system administration of the cluster (basically a multi-tenant Google super cluster) into the clouds and leaves this type of admin work to people (like Google) that are very good at this sort of thing. Second it is truly multi-tenant from the ground up, so efficient utilization of system resources is greatly improved, something Hadoop is currently weak at.

Put your Data Warehouse in the Cloud

So now given all this, what if you could build your data warehouse and analytics engine in the clouds with BigQuery? BigQuery gives you massive data storage to house your data sets and powerful SQL like language called Dremel for building your analytics and reports. Think of BigQuery as one of your datamarts where you can store both fast and slow changing dimensions of your data warehouse in BigQuery's cloud storage tables. Then using Dremel you can build near real-time and complex analytical queries and run all this against terabytes of data. And all of this is available to you without buying or managing any Big Data hardware clusters!

Modeling Your Data

In a classical Data Warehouse (DW), you organize your schema around a set of fact tables and dimension tables using some sort of snowflake schema or perhaps a simplified star schema. This is what is typically done for RDBMS based data warehouses. But for anyone who has worked with HDFS, HBase and other columnar or NoSQL data stores, this relational model of a DW no longer applies. Modeling a DW in a NoSQL or columnar data store requires a different approach. And this is what is needed when modeling your DW in BigQuery's data tables.

Slow Changing Dimensions

Slow Changing Dimensions (SCD) are straight forward to implement with a BigQuery data warehouse. Since typically in a SCD model you are inserting new records each time into your DW. SCD models are common when you are creating periodic fixed point in time snapshots from your operational data stores. For example, quarterly sales data is always inserted into the DW tables with some kind of time stamp or date dimension. With a BigQuery data store you would put each record into each BigQuery table with a date/time stamp. So your ETL would like something like this:

Nothing special here with this ETL diagram other than the data is moving between your enterprise to the Google Cloud. The output ETL is directed to BigQuery for storage in one or more BigQuery tables (note this can be staged via Google Cloud Storage). But now keep in mind that when creating a Big Data Warehouse, you are typically storing your data in a NoSQL, Columnar or HDFS type data store and thus you don't have a full RDMBS and all the related SQL join capability, so typically you must design your schemas to be much more denormalized than what is normally done in a DW. But BigQuery is a hybrid type data store so it does allow for joins and provides rich aggregate functions. How you model the time dimension is of particular importance - more on this later. So your schema for a SCD table might look like something like this:

Key(s)... | Columns... | EffectiveDate

The time dimension in this case is directly collapsed into what would normally be your fact table and you would want, as much as possible, to denormalize the tables so your queries require minimal joins. As noted Dremel allows for joins but requires that at least one of the tables in the join be "small". Where small means less than 8MB of compressed data.

So now in Dremel's SQL language to select a specific record, for a particular point in time, you would simply perform a normal looking SQL statement such as this:

SELECT Column1 FROM MyTable WHERE EffectiveDate=DATE_OF_INTEREST

This query will select a record at a known date. With this approach, you can for example query for sales quarterly data where you know the records must exist for that particular date. But what if you want the most "current" record at any given point in time? This is actually something Dremel and BigQuery excel at, because it gives you SQL functionality, such as subselects, that are not typically found in NoSQL type storage engines. The query would look like this:

SELECT Column1 FROM MyTable WHERE EffectiveDate = (SELECT EffectiveDate FROM MyTable WHERE EffectiveDate <= EffectiveDate)

This query can sometimes be considered bad practice in a standard RDBMS (especially for very large tables), because of performance considerations of the subselect. However, with Dremel, this is not a problem given the way Dremel queries scale out and the fact that they do not rely on indexes.

Fast Changing Dimensions

Fast Changing Dimensions (FCD) require a bit more effort to create in a typical DW, and this is no different with BiqQuery. In a FCD, you are often capturing frequent or near real-time changes from your operational data stores and through your ETL moving the new data into your DW. Your ETL engine must normally pay mind to when to insert a new fact or time dimension record and it often involves "terminating" the previously current record in the linage of a record history set. But buy leveraging the power of Dremel, FCD can be supported in BigQuery by just inserting a new record when the on-premises ETL engine detects a change, without terminating existing current records. And because you can perform the effective date based sub select, noted above, there is now no reason to maintain an effective/termination date fields for each record. You only need the effective date.

This makes the FCD schema model, stored in BigQuery, identical to the SCD model for managing the time dimension, however there is a catch. The ETL process must maintain a "Staging DW" of the records that exist on the BigQuery side. This Staging DW only holds the most current records of your table that exists in BigQuery, so this keeps it lean and it will not grow larger over time.

So with this model your ETL will only send changes to the Google Cloud. This overall approach for FCD is useful for modeling ERP type data, for example, where records have effective and termination dates and where tracking changes is critical. Here is a diagram of the FCD ETL flow:

Note, for the case of FCD model that is non ERP centric (data model does not depend on effective/termination date semantics), the Staging DW will not be required. This is typically the case when you are just blasting high volume loosely structured data into BigQuery, such as logs events or other timestamped action/event data. In this case, you don't have to detect changes and can just send the data to BigQuery for storage as it comes in.

Put your Data Warehouse in the Cloud

At Grand Logic we offer a powerful new way to build and augment your internal data warehouse with a BigQuery datamart in the Google cloud. Leveraging our real-time and batch capable ETL engines we can move your fast or slow moving dimensional data into unlimited capacity BigQuery tables and allow you to run real-time SQL Dremel queries for rich reporting that will scale. And do all this with little upfront costs and infrastructure compared to managing your own HDFS and HBase cluster in Hadoop, for example.

With our flagship automation engine and ETL engine, JobServer, we can help you build a powerful data warehouse in the Google cloud with rich analytics with little upfront investment that will scale to massive levels. Pay as you go with full control over your data and your reporting.

Stay tuned to this blog for more details on how Grand Logic can help you build your Data Warehouse in the clouds. We will be discussing more details of our JobServer product and how our consulting services can get you going with BigQuery.

Contact us to learn how our JobServer product can help you scale your ETL and Data Warehousing into the cloud.

Tuesday, September 18, 2012

The Big Data Evolution Will Continue - No Kidding

Big Data is very much about discovering information locked in your mountains of data that come out of your production center, IT operations, enterprise systems, and back office databases. Information is all in the eye of the beholder so one person's junk is another person's gold. These days with the volumes of social data and device data growing at astronomical levels there is a lot of data to sift through and make sense out of.

While it is true that the more data you can capture the more possible information to discover there is a limit to this. I think we are going through a cycle where capturing and trying to make sense out of vast volumes of data (social data, sensor data....etc) is becoming more economical and somewhat mainstream with respect to technology and tools. However, this is a cyclic I believe, at some point business will realize that maybe they are getting diminishing returns on all this data they are capturing and storing. For example, do I really care what I tweeted 20 years ago (20 years from now). I probably will never have the time to go back and look at it and I am not sure it is valuable to any marketing person (but who knows).

There is definitely gold to be mined in many data sets that now go untapped and technologies like Hadoop, BigQuery, Storm to name a few are good tools to use but not everything fits into the Big Data tent either.

There has been a lot of hype around Big Data these days and I see a lot of people trying to fit problems that really have no reason being shoehorned into Hadoop, other than it being the cool thing to do. You could do the data crunching in easier ways for example. However, the tool sets are expanding to give developers, scientist and business people more options when deciding how to store and analyze their data.

When thinking of Big Data first ask yourself the following question:

1) How much data do I want to capture and store (do you need to persist detailed records/data?)
2) How fast is this data being created (velocity).
3) How long do I want to keep it (forever?).
4) How long am I willing to wait to get "information" when I run my analysis (batch/hourly/daily or real-time).
5) What will cost me to keep all this data around and do I have the system admin muscle to do this?

This might help you determine in which of the particular emerging Big Data technology buckets your problem best fits and which approach to take (cloud cluster, on-premises cluster...etc).

Sunday, July 8, 2012

Big Data Automation in the Cloud

Grand Logic is happy to announce expanded support for cloud analytics and big data automation services through our flagship product, JobServer. With JobServer, enterprises of all sizes, from startups to Fortune 100 companies can leverage the power of the cloud to tap the full potential of cloud based Big Data computing and analytics processing.

With solutions such as Amazon EMR and Google BigQuery growing in adoption and becoming economically advantageous, business now more than ever need to automate the flow of data between their enterprise storage systems and the cloud. Moving data and information between corporate intranets and the cloud is vital for efficient cloud based Big Data processing.

JobServer's point and click automation and scheduling tools are ideal for centrally managing the flow a data between your Big Data cloud systems such as Amazon EMR and Google BigQuery. JobServer can manage the flow of data to orchestrate the loading and retrieval of data between your Big Data processing systems in the cloud while tracking all your Big Data job processing jobs to give you one place to see everything that is happening in your Hadoop or BiqQuery analytics processing.

In a typical deployment, JobServer sits on your corporate intranet and can load and move data between your in-house storage systems into the cloud for efficient processing then track all Big Data job processing activity to return the necessary critical data and results back in-house or to move it around in the cloud (for example, move data into and out of S3...etc). Alternatively, JobServer can also be easily deployed on the Amazon EC2 or Google Compute Engine instances and run in the cloud. There are multiple topologies possible based on your business operations.

JobServer comes with a built-in and open source plugin API that makes it easy to script and customize Amazon or Google web services apis and create custom tasks and jobs using Java, web services, GWT and python/ruby/bash scripts. For example, you can create complex map reduce jobs in JobServer and get notified when processing is completed and be alerted of any issues at every stage of processing. JobServer lets you also schedule and track detailed realtime and historical reports on all job processing activities whether you are running a Hadoop job, loading a table into the cloud, pulling data back out of BigQuery temp tables, or tracking the progress of BiqQuery batch processing jobs.

JobServer gives you central control over any automation task you want to perform in the cloud or between activities happening in the cloud and your local enterprise storage and database systems. Try JobServer today and see how you will wonder how you operated without it.

About Grand Logic
Grand Logic delivers software solutions that automate business processes and tame your Big Data operations. Grand Logic delivers automation software and Hadoop consulting services that maximize your Big Data investment.