Sunday, December 25, 2016

The Death of Visual Analytics and the Dawn of Conversational BI

In the last several years we have seen the emergence of a new breed of business intelligence products that have made it possible to build highly interactive and visually expressive and rich dashboards and reporting experiences. Products like Tableau, Domo, and Looker to name a few are replacing established BI heavyweights with a focus on self-service and rich visualizations.

What is driving this trend? Well anyone not living under a rock for the last then years will tell you that the explosion of data on the internet coupled with the advancement in Big Data related technology have made storing and accessing data much easier than ever before. But this alone is not the whole story.

Self Service BI is Good but not Good Enough

Products like Tableau have come onto the seen to lower the barrier for connecting to internet accessible data sources and as well to traditional sources locked up in relational databases and in the billions of excel spreadsheets sitting around the enterprise world. Driven by this, Tableau, for one, has been successful for three primary reasons:
  1. Provides many out of the box data source connectors with an easy to use interface - connect to just about any data source.
  2. Self service analytics without some of the heavy lifting - you don't need an army of data and tech experts to model your data and meta-data.
  3. Highly compelling and visually rich analytics features - the visualization you can create with Tableau are stunning - not always easy to do, but much more achievable than ever before.
So this is all great, but what does this have to do with the death of visual analytics? I seem to be saying richer BI visualization is blossoming and inspired by tools like Tableau. Well, I will argue that item number three listed above is an evolutionary dead end and that are going to see a gradual trend away from visually rich analytics.

There is such a thing as too much of a good thing. More visualization does not mean you are solving business problems more effectively, answer questions faster, finding root cause (answering why questions), or getting better predictions and trends? In fact ,too much visualization might be overwhelming users.

A Stroll Through BI History

Let's take a quick ride back in time before we look forward. Human civilization has been evolving for thousands of years and our way out of the stone age was guided by the development of human language and communication. While it is true that a picture can say a thousand words, with the spoken or written word, on the other hand, one can express all of human existence in a short phrase, e.g "to be or not to be" or "I love you". Human expression through words is powerful - more powerful than any picture.

My point is that human communication is the most powerful expression and exchange of information. It is a fact that visualization is a powerful tool, but it pales in the presence of the written or spoken word. You can probably guess where I am going now.

Computers and computer to human interfaces have evolved over the past sixty or so years on a twisted evolutionary path. We started with simple command line tools and interfaces (mainframe), where we issued simple grunting commands and got back simple grunted responses from our computers. We then saw this lead to the evolution of rich graphical computer windows, icons and the mouse (point-drag-click). While this helped advance our interface and interaction with the computer and with extracting data from within these artificial devices, this path of human to computer interaction is effectively an evolutionary dead end. It pales in comparison with what is coming next.

Evolving Toward AI Conversations - More Than Pictures

Products like Tableau, Looker and others will need to evolve in the coming years or be left in the dustbin of technology. While we have seen amazing advancement in rich and interactive visualizations of data, I argue this is the wrong path and effectively an evolutionary dead end. How many times have you looked at Tableau dashboards (or other BI visualization) and saw beautiful and rich colors, shapes and graphics only to be overwhelmed by the information? What does this information mean, what does it tell me, what questions and answers are buried in this beautiful and rich visualization?
Tableau: Endangered Safari Animals
What if instead of being bombarded by visualization alone, you can converse with the data - converse with the machine? Having rich visualizations can be fantastic, but I would want to ask the machine to answer questions about the visualization - make predictions or tell me "why" this is occurred - point me a the root cause. We are moving to a new dawn where machine learning and AI will help us make sense of the information around us that that is currently locked up and visualized by computers. And this requires a new way (back to the future) for humans to interact with BI.

While computers started out as simple command line beasts, our current evolution toward more and more visualization is an evolutionary dead end. We will soon be moving toward a voice and messaging first world - where visualizations will augment our experience of information and are a tool for us to engage in conversation with our AI powered BI applications and virtual assistants. Chatbot BI virtual assistants are on the horizon.

More Than Just Looking Pretty - Answering Questions

You can see the beginning of this already. Tableau recently announced they will be releasing, in 2017, a new NLP interface to their platform - competitors will follow - and this is only the beginning. We will one day be able to ask questions of your BI in natural human language. The AI powered analytics revolution is coming. Conversational interfaces are a game changer for BI. Analytics as a conversation will no longer be the stuff of movies and sci-fi.

Driven by the advancement in AI and machine learning and with the massive surge in adoption in virtual assistants, chatbots and messaging/voice applications, the future will be here sooner than we think.

Tuesday, November 1, 2016

Thougtht Experiments are not Agile

Agile methodological such as Scrum are the rage these days. They have helped organizations achieve some perceived control over delivering projects on time and managing the overall scope of projects. In most cases when waterfall or agile processes fail it is the failure of the organization in managing expectations and not necessarily in the processes that were used. I think everyone can agree on that.

This blog is not meant to be a bashing or endorsement of scrum, waterfall or any other project  methodology. Personally, the approach I usually prescribe to is that you do what you say you are going to do and that your processes (whatever they are) are repeatable and are well understood by all in the organization. And from one release to the next you work to make your process better. End of story.

Now having said all that, lets get into what I really want to talk about. What kind of methodology inspires and nurtures the best and most innovate design work (software, UX or any other creative design) and ultimately leads to the best products and systems being built? No surprise that the answer has nothing to do with whether you are using agile or waterfall or with how much analysis and design documentation you generate. The answer instead has everything to do with tapping the human imagination.

The best designs and products don't come from a process of painting by numbers. You can break down a user story into as many tasks as you like (in your favorite scrum tool) and you can have "spikes" that lead to other user stories and build POCs until you are blue in the face, but this does not lead to creative or ground breaking results. What does? Well, if we look back at some of the greatest designs from the amazing works of such people as Archimedes, Galileo, Newton, and Einstein, these great minds had something in common. They conceived of many of their brilliant works with thought experiments.

I argue that great design can't be prescribed into design documents or crafted from user stories. The best and most innovative designs come through thought experiments. Then once you have the outline of a concept in your mind you can runoff and start coding and crafting your ideas and attempt to let them take form. There is nothing like immersing yourself in deep thought over a problem. Few of us can claim to be at the level of brilliance of an Archimedes, but thought experiments are the path to brilliant work, whatever discipline you are in. I don't want to diminish the value of hard work and having good team discipline, but innovative designs are best done deep within mind experiments. Only in your mind can you think big while juggling all the variables and constraints, yet aiming small on your target and balancing all the parameters and dependencies that your problem and solution domain demand. No great design document or architectural blueprint will come to shape without this, if it does, throw it away and start over.

So when you are sitting in your next sprint planning meeting, see if you can carve out a few user stories for some thought experiments. You might just deliver on some of your best work :)

Tuesday, October 25, 2016

Goodbye Apps and Hello Bots

The shift in the market is undeniable. Bots are beginning to challenge the established mobile app store ecosystem. There is plenty of evidence that mobile app adoption has plateaued and that the average user has lost their excitement for downloading and experimenting with new apps. There are more than 2 million apps in the Apple app store now! Ask the average mobile developer - it is almost impossible to get your app noticed or discovered in such a crowded space. Apps will always be with us, much like desktop applications and the company website, but there is a sea change.

Disruption and the New Players

It is becoming clear that jumping from one mobile app to the other is not a great experience for most users (especially for enterprise users) and this is giving messaging apps like Slack and Facebook Messenger the opportunity to become the new app/bot marketplace. GUI-less bots are more easy for users to transition from and to, and they make it seamless to switch between bots and more natural to interact with an application service using human like conversation (something people are already doing in droves on messaging apps). These bots are basically mini-apps with conversational interfaces. Slack (for the enterprise) and FB Messenger (for consumers) are both becoming the new application playground; and the promise of an AI enabled world is lurking within them to provide a user experience that traditional GUI apps are not capable of.

Microsoft is chasing Slack (using Skype) to establish itself in the enterprise team messaging market and in this new emerging bot marketplace. For Microsoft, this is obviously an opportunity to disrupt the mobile app market (where they have lost) and establish an early beachhead with bots, AI and enterprise team communication. Microsoft has clearly been leading the charge with products like Cortana, LUIS and their Bot Framework. All the other big players are in the bot and AI game as well, and the race is definitely on for who can deliver the best bot solution for developers. There is a new land grab in the making between the big tech giants, developers and startups.

How Do I Deliver My Bots?

I describe all this because to deliver conversational applications (aka bots) to end users, developers need a platform and a bot marketplace. Messaging apps will be that vehicle to supplant the traditional app store ecosystem, because building your own custom bot infused mobile app will not be the way to go for most developers in the future. Building a custom mobile app for your bot might still be possible in some situations where an app already has an established user base - like a banking app - but for the average bot developer messaging apps, like Slack, will be the delivery platform.

Messaging apps like Slack also offer a lot of out the box backend integration to help deal with single-sign-on, identity management, permissions, roles and executing custom business logic (via webhooks) for backend integration. Apps like Slack provide much of the platform plumbing for this backend integration that your bots will need, and enterprises are already adopting team messaging apps like Slack. This all lowers the barriers for connecting your bot to a companies cloud and back office systems, in order to get access to the necessary the data and enterprise systems.

Messaging & Voice First Applications

I think the future model for developers will be to deliver their bots and AI conversational services through tools like Slack and possibly others popular "platform messaging apps" such as Cisco Spark, HipChat, Fowdock, FB Messenger, Skype, Kik, and others. All these messaging centric platform apps are already spreading fast through the corporate and consumer world. Developers will be leveraging these messaging platforms to deliver their AI services in the form of bots and conversational user interfaces . Mobile app stores will always be with us, but the game is changing. The new AI marketplace is happening now, get your bots ready!

Friday, October 7, 2016

Bots, AI and the Future of Augmented UX Design

We hear a lot these days about technologies such as futuristic looking VR goggles and mobile apps with augment reality that enhance our interactions with the physical world using a computer generated reality that overlays and assists in our interpretation of the world around us. As computer users, have become accustomed to rich visual interfaces as desktop, web and mobile app technologies have matured. However, the next leap forward in human-to-computer interaction will not be more visual effects, but in fact less, and we are seeing the beginnings of this shift in what we refer to today as "bots". This is only the beginning of a seismic shift in how we as end users interact with applications.

GUI Alone is Not Enough

Now, what if we could have the same augmented reality type experience applied to the countless GUI applications we all deal with on a daily basis both at work and at home and from desktop to mobile? What do I mean? Computer applications are already computer generated, so why do they need an augmented reality? Yes that is true computer/mobile applications are already based in the virtual real-estate of the computer (or mobile device), but why do we need any augmented assistance while dealing with the computer application

GUIs are Not Natural

If you reflect carefully on what is happening with bots and AI in type applications in general, we are seeing the creation of a new human to computer mode of interaction that can assist us in how we interact with the virtual world of the computer application. The dumb and boring old computer or mobile application screen is about to get a big dose of intelligence! Having an intelligent conversation with your application (not just mouse clicks and keyboard taps) will become the norm and is not just the thing of science fiction. Note, this bot/AI augmented application does not necessary have to voice converse, have a personality or hold a deep philosophical discussion with us (maybe some day), but it will be able to assist us in our current world of application beyond just the visual windows, buttons and menu options we have today.

We have come accustomed to interacting with our computers using an already mature human-to-computer model of clicking on buttons and visualizing our experience with a computer through drop down lists and dialog boxes among other widgets and interfaces. But what if we could augment our interaction to a consumer application (i.e. banking app) or an enterprise application (i.e. supply chain application) with a  an intelligent chatbot that could aid us in the interaction with said application and the many knobs, controls and actions you can invoke on the application screen? This bot assistant could remember what we have done in the app in the past and guide us through taking actions using a combination of chat/message exchanges sprinkled in with intelligently timed suggested actions. This could in fact lead us to to a situation where we do not needed the full blown array of buttons and menus that bloat the apps we have today. We could have a conversation with the application with the help of an intelligent and conversational chatbot assistant.

The Shift to CUIs is Unstoppable and Inevitable

Well this is where the world will soon be fast moving towards. With the ubiquity in mobile communication and advancements in machine learning, AI and big data, the scene is now set for every application we have come accustomed to using to have a chatbot assistant that can aid us in our interaction with the application itself. No more stupid dark ages style help documents to sift through. Think of it as online docs help on steroids and this just the beginning.

Any enterprise or consumer application team not starting to think how to they can replace their outdated online help docs and bloated UIs with more efficient and engaging intelligent and interactive chatbot assistance will be left in the dustbin of technology history. Don't worry you have a few agile sprints before this happens :)

This transformation is not to be taken lightly although. It will be a significant investment of both engineering/technology and a big leap in thinking in how we design user experiences for end users and how we can expand the visual application metaphors we have grown accustomed to with new intelligent chat assistants that can guide us through the navigation of information and assist us in the potential actions that can be taken within an application.

Evolving your Development Processes for CUI

This technology leap will require a big shift in thinking from product owners, UX designers, information designers and engineers. This will require everyone in the product development ecosystem working together using enhanced business and engineering processes that put this new augmented UX design philosophy in the forefront while on the engineering side leveraging fast maturing technologies in the areas of NLP, AI, and machine learning to enable captivating and predictive conversational engagements between humans, their devices and their applications.

The applications of the future, whether on your desktop or on your mobile device, will in the coming years begin to manifest augmented UX capabilities. Be prepared for this new world whether you are a developer, UX designer or end user of these applications. Conversational interfaces are coming to an application near you to augment and enrich your application user experience!

Sunday, October 2, 2016

What's Next? Conversational Enterprise Applications

There is a lot of chatter these days (excuse the pun) around AI, machine learning, and chat bots and how this technology stack can be used to engage with users, at a human like level, to exchange information and automate tasks. The elusive goal of an all intelligent AI machine that can by indistinguishable from a human to help us with day to day tasks has been with us since the Turing test and has been embedded in our psyche from countless sci-fi movies.

Bots Here and There and Everywhere

Today, that elusive goal is closer with the many advancements in computing and communication technology. We are starting to see the real application of such technology in tools like Slack where chat bots sit in the background ready to engage in channels/rooms to assist and respond to natural language chat communications to help solve/automate DevOps tasks or monitor and manage IoT infrastructure among many other applications. And we also see it with more casual consumer applications like Siri, Cortana and Alexa.

Where this is all headed is exciting both for consumer and enterprise applications. However a lot of the current focus for how and where conversational interfaces can be applied is still stuck in the past. In my opinion there is too much attention given, for example, to a chatbot's personality and if that chat bot is behaving with true human like mannerisms. I think this distracts from the actual transformation that is happening and the opportunities that lie ahead for where and how conversational interfaces can transform business applications. A conversational interface does not necessary need a human personality to be effective - keep that in mind when you go down the path of building this new form of user experience into your applications.

The Smart Command Line Interface

If we look back in time, we started first with "dumb" computer command line interfaces (i.e. the green screen CLI), then through the 80s, 90s, and early apart of this century we went through a steady evolution toward more visually rich human to computer UX (think desktop apps, web X.0 and mobile). Interestingly enough, this has brought us full circle and back to the command line interface (CLI). But this new CLI is now "intelligent" and has the potential to take us to the promised land of conversational human to computer interaction. I wont get into the theory of why the intelligent CLI (aka the conversational interface) has reemerged and why it will prove to be more effective than our current bloated visual UX application world. And remember that the conversational interface can also be voice driven, but voice to text is more or less an added bonus and part of the longer technology maturity in this space.

What does the future hold? I propose that this next generation intelligent CLI should augment every business application in the coming years. Every enterprise should take a hard look at their current UI applications (business and consumer facing applications at every level) and make it a high priority to embed AI chat bot like intelligent CLIs (with or without a personality :) into every user experience and business function they have and for every application persona

Don't Get Left Behind

Whether you are building an ERP application for an HR manager or a sales executive, or an IoT monitoring platform or an analytics dashboard, every one of these applications should have an intelligent conversational interface to augment the visual interface. By 2020 any enterprise not beginning to bring to the market such investments in conversational interfaces within their applications will be left in the dark ages of the visual only UX world.

Friday, September 23, 2016

Analytics + AI = Analytics as a Conversation

The pendulum is swinging in the business intelligence and analytics world. The on going technology evolution driven in part by the adoption of Big Data, machine learning and other advancements in cloud computing have made the storing, modeling and analyzing of huge volumes and velocities of data possible. The tools and IT skills needed to turn this data into rich visual information is now more possible than ever before.

Products like Tableau, Splunk, Qlik, Birst, among others, have brought rich visualization and actionable-minded analytics (actionable analytics still not that common :) to the masses. It is now easier more than ever to build rich visualizations, reports and dashboards. Building BI solutions to tackle all that data percolating around us and across social networks, IoT and within the enterprise is available to the IT masses to build compelling visual user experiences.

Visualization Overload

But there is trouble brewing on the horizon. Is there such a thing as too much data? Too much information? Too much visualization? I have built my share of BI and I have seen many amazing and compelling visualizations and dashboards using powerful solutions like Tableau and many home grown SaaS BI platforms. But I think it is time to step out of the forest and look at how humans effectively interact with information.

While we rely heavily on our visual sense, even the most well intention and minimalistic BI dashboard (and its supporting drill-down reports) might not be the best solution all the time at getting to the information you want or need. Humans have another ability for consuming information, the conversation (question and answer).

There are many technologies now converging and making it possible for us to evolve our BI stack beyond purely visualization based analytics. Analytics-as-a-Conversation (A3C) in my mind is the next frontier for BI. It does not necessary replace today's rich visualization based BI, but augments it.

Living in the Matrix

What is A3C? Well, in movie terms, it is sort of the Matrix. It is about having a conversation with your BI and getting at what you need (the what) through normal human-like conversation (think texting, hashtags, tweets and even emojis). Also, this conversational form of BI is a much more natural way of interacting with complex information and can more naturally lead to asking not just the "what" questions but the "why" questions to your BI Matrix. And this form of information interrogation lends itself to setting a more clear context to the information exchange, as the BI conversation progresses from one question-answer to the next question-answer. For example, perhaps you ask your A3C system the value of a particular KPI or which KPI is the most off its norm this quarter. And then this can naturally lead to such questions as to "why is this KPI higher this quarter?"

Obviously we are not Neo and we are not talking to the Matrix, so the system has to be taught (or programmed to learn) how to converse with a human-like grammar and has to programmed to extract what it needs from the grammar/questions using NLP and then translate that into queries against the target data and metadata system. There would have to be bounds on the grammar and enough knowledge of the system's metadata to compose the proper answers. No small engineering effort, to say the least, but from where we are today with AI, bots, machine learning, NLP and general computing stacks, the technology is there to accomplish this.

All the Right Ingredients Coming Together for AI

Why now? Because the technologies needed to construct the BI Matrix I am describing is largely here and the data volumes are now, in my mind, overwhelming even for the best BI visualizations. With a bit of creativity (and sweat), and with current availability and advancements in Machine Learning, AI and general computing power, it is possible today to begin to build such intelligent conversational analytics systems and user experiences. Don't forget this a about changing how the user "experiences" data.

It is not just about data volumes and technology capabilities, human interaction has itself evolved in the past decade. We have seen with the recent explosion of mobile and social communication that humans are using texting and short messages for communication more than ever and with no sign of ebbing. In fact, texting is quickly becoming the dominant form of communication and the main form of information exchange across the globe and across all demographics.

How is this better than the visualization based BI we have today? Well, I would say it is not necessarily a replacement for the BI we have today, but is instead complementary and can lead to BI answering questions of "what" and "why" that the original BI developer/modeler could not necessarily anticipate out of the box. And as artificial intelligence and machine learning systems continue to evolve and improve the potential is virtually limitless and no longer bounded by what can be rendered on a 2D display or a click of the mouse.

Back to the Future

The revenge of the CLI (the command line interface) is upon us :) But don't underestimate the conversational CLI, it will prove to be orders of magnitude more powerful than any visualization a human can conjure up.

Stay tuned, Analytics-as-a-Conversation is coming and we will all be talking about it (or talking with it).

Tuesday, August 30, 2016

Getting Bitemporal With Your OLTP Database

The concept of a bitemporal database can seem a bit exotic and complex to be considered for a typical RDBMS schema model. While the transaction processing and query structures required to make this happen, with standard RDBMS, are more involved than a normal database model, it is a fairly straight forward design methodology to annotate every table in your RDBMS model with bitemporal semantics.

See the table below for an example for what the table structure might look like. The TT start/end columns are the transaction time dimension and the VT start/end columns are the validity time dimension. These four columns drive the basic schema model structure for a bitemporal database and enable powerful queries that can pivot and scan for data across two time dimensions without the need of a data warehouse or other complex analytics.

The table above looks straight forward, right? And it is. The bit of complexity comes with how to handle the actual data mutations (a change to a row) and insure that every row that is superseded by a a new TT and VT tuple in proper time semantics and that this is handled in a transactional consistent fashion to insure a continuous flow of tuple epochs (an epoch is a row at a particular TT/VT point in time) where the new epoch that supersedes the prior epochs properly terminates the TT and VT epochs with the start of the new TT and VT epochs.

The advantages of a bitemporal schema model are many. They include:
  1. Immutable data structures which means all tuples preserve all changes across time.
  2. Built-in audit trail functionality, since no changes are every overwritten.
  3. The ability to write fairly simple queries to view data any point in time.
  4. Easily compare any two points of time for changes.
  5. The ability to find all changes across a time range.

Injecting bitemporal capabilities into your schema will allow tracking every change that happens within a table across two time dimension: transaction time (when the mutation happened) and validity time (the time range the mutation and current state of the row is valid).

Some databases such as Oracle, DB2 and PostgreSQL have specialized extensions to support bitemporal capabilities, but you don't really need these extensions - they only help with the DDL aspect of the design and not with the DML or query aspect. For the most part, these extensions are just syntacitc sugar that you can implement on your own in a more cross database fashion and even extend to support NoSQL databases as well.

Get started with turning your schema model into a bitemporal powered RDBMS. Contact Grand Logic to learn how we can help you build your next bitemporal database environment.

Thursday, May 26, 2016

Building ML Pipelines

What is involved in building a machine learning pipeline? Here is a common flow:

  • Data pre-processing
  • Feature extraction
  • Model fitting
  • Validation stages
Learn more about ML pipelines (from a Spark perspective).

Wednesday, May 25, 2016

Tuesday, May 24, 2016

Spark Structured Streaming - Crazy Like a Fox

Big Data related computing has matured greatly over the past several years from its early and humble Map-Reduce days. Hadoop introduced developers and enterprises to mainstream distributed computing on commodity hardware and with a software stack (largely Java based) that was accessible to the average developer.

Early versions of Hadoop did not have the most developer friendly APIs, but they made breaking up large computing tasks and iterative processing possible to scale without big iron and SMP hardware. Things evolved and improved with the emergence of memory efficient Big Data engines such as Apache Spark. This has also been helped by the fact that memory prices keep dropping.

A lot of attention has been given to Apache Spark these days as the successor to Hadoop. The big advantage that Spark is touted to have over Hadoop is how its Map-Reduce engine leverages distributed memory to improve performance over classic Hadoop. While this is true, the broader Hadoop ecosystem has been evolving rapidly as well, so this alone is not at the heart of what has given Apache Spark such a big leap forward.

What is often underestimated in the growing popularity of Spark, is its API. If you have ever tried to write a Map-Reduce type job in Java Hadoop 1.x or 2.x you would understand. Spark is API plural with support for Scala, Java, Python and R. The way you build data processing pipelines and construct transformations and aggregations in Spark is well thought out by the authors of Spark.

Sparks is not standing still either. With the development of Spark Streaming, Spark SQL, DataFrames and DataSets in the Spark API, Spark is making the development effort of manipulating data and writing processing logic much more intuitive for developers. The elegance of the Spark API is a key part of the reason why Spark has grown in popularity.

One knock on Spark is that it is now being obsoleted by the next wave of compute fabric engines that are built from the ground up to be realtime streaming centric. Many claim that this streaming first architecture is superior to Spark's batch based architecture for both general purposes processing and especially for streaming operations. Products such as Storm, Fink and Apex, just to mention a few, have garnered a lot of attention. The claim is that by using a streaming first architecture, these engines can do both batch processing and streaming more efficiently than Spark does batch and micro-batch bases streaming.
What is often left out of such as debates is again the API. If you have ever tried to write a Storm processing stream you will know what I mean. So again here, this is where Spark shines with its more intuitive APIs. 

Now this is where we get to Spark's new API coming out in the soon to be release Apache Spark 2.0. Spark will be introducing a new Structured Streaming API that will unify streaming, batch and Spark SQL. The Spark team is raising the productivity bar with how developers use APIs by unifying the building of both batch and streaming applications.

The idea is that a streaming application is really a "continuous application" and that the best way to build a streaming application is not reason about streaming altogether. In other words, Spark 2.0 with Structured Streaming, will make building streaming application no different than building any other Spark application. The streaming aspect is essentially declarative and the Spark engine will do the work of optimizing the stream. The big advantage this has for developers is that we can continue to think of our applications in the same way whether they are doing streaming or batch.

Spark 2.0 with advent of Structured Streaming will leapfrog Spark ahead of the other competing streaming first engines by removing the stream design complexity while at the same time brining Spark's elegance to building APIs to the forefront.

At then end of the day, Spark's well designed APIs will prove to be pivotal for developers. Developer productivity and Spark's fast evolving optimized engine  (Tungsten...etc) will offer a hard to beat combination of developer productivity and raw scalable performance. The idea of having a programming model that does not require a developer to reason  about a stream and instead let them focus on the higher order functions of their application will in the end prove more superior vs the harder to use streaming first engines such as Storm and the like. This unified programming model also frees the Spark engine to evolve the low-level streaming plumbing over time without impacting developers.

Wednesday, May 18, 2016

Fluent Interfaces a Beautiful Thing

Fluent programming interfaces when down right are an elegant thing to behold (for a programmer). They require no specialized learning verses what it would take to build and model the same sort of domain logic in an external DSL. While specialized DSL's have their place, they create a challenging ecosystem to support and impose the need for additional moving parts outside the core development of the application and system. When the dedicated long-term resources are applied to supporting a DSL, there is no doubt external DSLs can be a powerful thing. But in the absence of this, Fluent interfaces are a powerful software programming pattern.

Here is a good video presentation describing the pros and cons of fluent interfaces vs using external DSLs. The presentation provides a pragmatic perspective from a point of personal experience in the industry.

Like anything, fluent interfaces can be abused, but when used with good intentions they can create easier to build, read and maintain software. What are good examples of fluent interfaces? There are many examples and I have noticed more frameworks and APIs supporting. Cassandra's Java driver is one example (QueryBuilder) and frameworks like Apache Spark and other general map/reduce data flow processing APIs make great use of fluent interfaces.

Here is a snippet of code I borrowed from Martin Fowlers post on the subject that gives a before and after example of using a fluent API:

private void makeNormal(Customer customer) {
        Order o1 = new Order();
        OrderLine line1 = new OrderLine(6, Product.find("TAL"));
        OrderLine line2 = new OrderLine(5, Product.find("HPK"));
        OrderLine line3 = new OrderLine(3, Product.find("LGV"));

private void makeFluent(Customer customer) {
                .with(6, "TAL")
                .with(5, "HPK").skippable()
                .with(3, "LGV")

So, while fluent interfaces don't give you the power of a full fledged external DSL, they can be a productive boost to any API you are building. So give fluent interfaces a look at in your next framework, they can make your code easier to build and maintain.

Saturday, April 23, 2016

Visualizing the Data Science Disciplines

Nice visualization showing how the various data science disciples interrelate. Puts some of the hype around artificial intelligence, predictive analytics and big data in some perspective.

Thursday, April 14, 2016

Understanding Supervised vs Unsupervised Machine Learning

I always found it a bit difficult to explain how labeled and non-labeled data sets factored into machine learning algorithms and the related training/modeling process. This short explanation I found on stackoverflow helped crystalize it for me:

I have always found the distinction between unsupervised and supervised learning to be arbitrary and a little confusing. There is no real distinction between the two cases, instead there is a range of situations in which an algorithm can have more or less 'supervision'. The existence of semi-supervised learning is an obvious examples where the line is blurred.

I tend to think of supervision as giving feedback to the algorithm about what solutions should be preferred. For a traditional supervised setting, such as spam detection, you tell the algorithm "don't make any mistakes on the training set"; for a traditional unsupervised setting, such as clustering, you tell the algorithm "points that are close to each other should be in the same cluster". It just so happens that, the first form of feedback is a lot more specific than the latter.

In short, when someone says 'supervised', think classification, when they say 'unsupervised' think clustering and try not to worry too much about it beyond that.

Hope you find it useful :)

Wednesday, April 6, 2016

The Industrial IoT and the Rise of Cloud Machine Learning

The Internet of Things (IoT) has been driven by advancements in many areas of technology along with the ever expanding reach of the internet. This has made it feasible today for any device big or small to be connected to the world.

Alone, having billions of devices sharing information is more or less noise. The IoT is of little value unless businesses and industries can turn raw contextual data into valuable and actionable information. We are at a turning point across all industries where the volumes of data being generated have the potential to be turned into vital business information.

The Killer App
The IoT is founded on two principles: first and foremost is the ability to efficiently collect and catalog the vast sets of data available from sensors and other internal digital systems and handling this in a timely manner. Second, it is about creating machine learning models and the related analytics that can drive predictive and prescriptive decision making opportunities for the owners of these devices and industries. With the ability to enable any device big or small to be connected, the opportunities for data gathering and intelligent decision making based on large values of timely data will propel many industrial killer applications that can turn the data into value that can be used to optimize business functions and business operations. The opportunities for an industry or business to create their own IoT killer apps is at its beginning - there are countless opportunities across all markets. The efficiencies created by using data gathered from every corner of your business will drive huge opportunities for business optimization and efficiency.

Much of the initial interest in the IoT started around consumer and retail types of scenarios for such things as tracking and monitoring consumers or optimizing product movement in supply chain scenarios. While is this is all good, we are now moving beyond the consumer aspects of this and the IoT is now invading the world of the Industrial Internet where the potential benefits have the opportunity to create tremendous business efficiencies that dwarf the opportunities in the consumer and retail opperations space and this can ultimately offer profound benefits to human advancement.

Get Your ETL Groove
Predicting failures and optimizing maintenance/operations have tremendous value across all industries from healthcare to aviation, but this is the end result of an overall process. Some of the less glamours aspects, that ultimately enable machine learning and predictive analytics, lie in the challenges of collecting this wealth of data that the downstream machine learning and analytics that are based on the data. Whether you are wind turbine power plant or a railway operator, collecting data from field operations is no small task. You can not get to useful machine learning models without gathering the data needed to train and feed your models. This often is an obstacle for industries not accustomed to collecting data using a Big Data mindset (velocity/volume/variety and context), but is a critical first step to be conquered.

Leveraging the Cloud
Here is where cloud services and machine learning PaaS solutions can help move industries from design to live deployments. Moving data into the cloud for cleansing and ETL processing is the first step to prepare your data sets for consumption by your data wranglers and data scientists. The good news is there are many startups popping up helping with this end to end process. Machine learning services are a new market for startups, and it is definitely worth looking into leveraging such services if you don't have the capacity to build the data wrangling, machine learning and predictive analytics yourself.

Leveraging the cloud is a great option for many businesses, and there are already many options to choose from. You can build your own from the ground up on an IaaS cloud environment or leverage the growing list of small and PaaS big cloud services providers coming on line. One interesting trend is the focus on domain specific machine leaning and analytics for the IoT. Companies such as Predikto, for example, focus on predictive analytical services for targeted vertical industries (rails and aviation in this case). I think we will see an increase with startups focusing an abstracting away the technology complexity and plumbing and offering end users more on the end to to end services geared toward a particular market and industry. This focus on vertical domains also aligns well with how machine learning models need to be tuned and optimized and how each vertical industries tailors its own predictive and prescriptive decision making.

Take Some Action
As we move past the first phase of the IoT where it has been about tracking and monitoring the many connected devices, in the next phase industries will be wanting to make actionable sense out of their data in ways that can improve business efficiencies and replace slow reactive human decision making with real-time decision making based on machine learning models. The ultimate goal is to reach a point where the decision making is prescriptive and even potentially AI powered, but we are a long way from that. For many businesses, it is about wrangling their data and enabling the "Things" they are tapping into to be able to feed to the cloud in order to build and drive their machine learning models and downstream analytics.

The comparison I like to make with how the IoT is transforming business, across all industries, is similar to what happened with financial marketplaces with the advent of digital trading platforms and high frequency trading systems. The financial trading platforms of today collect and monitor vast amounts of data points and everyone is looking for the most timely and  actionable information in order to beat the next guy. This is what is happening with the IoT, in large part. It is bringing to the surface vast amounts of data from every corner of a business and making it actionable. However, it will take time for the many industries, from farming to manufacturing to get their machine learning bearings. Again, don't look to build of it all yourself. There are many cloud and big data resources and even full services startups specializing in your industry that can help.

Updating Machine Learning Models
The process of making predictions and anticipating the future is based on building accurate models of your industrial world. These models are often not static. Models change over time and as the many variables that impact your business change and as your business itself grows and evolves. So a common consideration in machine learning is how to keep evolving your models. A good post on this subject can be found here. This article describes how time-series prediction (where historical data is vital to the model - like in the case of weather modeling) and how feedback data (data as a business grows - retail store start to sell new products) impacts machine learnings models and can trigger machine models to be retrained from scratch. Understand how your models evolve is an important aspect of machine learning, because without accurate models you have bad information. You are only as good as your models and keeping them up to date is a constant effort by your data wranglers and data scientists.

It is also important to appreciate that observing your business can change it as well. So you must always be looking at retraining your models as you use predictions that optimize your business. The process of optimizing your business (for the better hopefully) requires changes and updates to your machine learning models on a regular basis.

IoT + ETL + ML Models + Cloud = Optimizing Business
There are many considerations when beginning your Industrial IoT journey. There are no short-cuts, and the effort requires investing in and developing new skills and leveraging new technologies, but the journey will profoundly change your business for the better.

Friday, April 1, 2016

Law of Parsimony Strikes Back

Let me first start off by saying (hate it when people start off saying this - usually means some principled BS is coming) that many new Big Data technologies such as the concept of Map/Reduce, Machine Learning and products such as Hadoop, Spark and NoSQL databases are great tools to have in your IT arsenal. Also don't forgetting other infrastructure technologies such as hardware virtualization, software containers and other micro-services deployment architectures that are making IT environments more flexible and more manageable (note, this does not mean simpler). There is no doubt these technologies fit a number of problem domains that in the past where very hard to do with standard computing stacks, IT tooling, and relational database technology.

Now having said that, let's be careful not to over apply them and end up with a system that is fragile and takes an army (or perhaps small army) of super smart operations people to deploy and run. I always go back to one of my favorite principles, "the law of parsimony" or sometimes referred to as Occam's razor. Boils down to the reality that nature has a habit of looking for the simplest path to solve a problem. We can't sometimes see simple elegance although. Take for example nature's design of the common leafy tree. Wile it has complex structure, this structure comes from some very simple principles.

I feel we are at a point with technology where we are advancing at a great pace, but in the process of doing this we are creating a lot of complexity. As Occam's razor states, complexity is a relative consideration among the alternatives, so the bar is always moving up with what we consider to be complex, but we sometimes need to step back and not use a bulldozer when a shovel will do the job just as good and will not break down on us when we need it most.

I feel that way with a lot of technology I see being applied. There are so many options to choose from that sometimes the simplest option for the problem is overlooked. This might be human nature, "he who dies with the most toys wins", but this can be a costly mistake for many businesses.

I often joke that if you give me a bunch of plain old app servers (pick your favorite) and a relational database (pick your favorite), I can move the world (might need a load balancer in there somewhere ;). So, when you are in your next architecture design meeting ask yourself this question, can I bend my tools to my will or do I need new toys to play with :)

Saturday, March 26, 2016

The Era of Deep Learning is Here

Have to agree with Google on this. Innovation in Machine Learning & Deep Learning combined with serverless cloud platforms will turn more and more data into actionable information by making these data science services and functions available to a wider audience.

This comment from the article is telling where we are and where we have to go:
We also see data scientists complaining that they spend up to 80% of their time preparing the data and training the models before they can even begin to extract any value out of the current machine learning technologies. In fact, some data scientists sarcastically call themselves “data janitors,” because they spend more time preparing data than they do analyzing it.

It is currently fairly complex from an IT perspective to construct the infrastructure and services need for training and building learned models and effectively ingest data at the scale and velocity needed to turn raw data into value. This is also complicated by the challenge to find the necessary skill lets needed to make this happen. The market and landscape however is evolving fast on all fronts, both on the IT front as described in the Google article and with more IT skill set specializing coming available. Stay tuned :)

Monday, February 29, 2016

Essbase Analytics with Tableau, Cassandra and Spark

Using Hyperion Essbase? Looking to get some of that financial, accounting, sales, and marketing data that is locked in your Essbase cube, out and into something more accessable? Essbase is a very powerful data modeling platform, but it was built quite a while back (in tech time) when multi-dimensional modeling and DSL languages like MDX where a new frontier for data modeling and analytics.

 Essbase was also built when requirements for analytics, reporting and visualization were much more constrained and the expectations for realtime reporting were not as demanding as they are now (not to mention data volumes). There are many organizations using Essbase for critical business functions, so streamlining the path to quicker decision making and more robust what-if type analysis is critical to being competitive and for optimizing the operational performance of your business.

Oracle Essbase has a number of built-in applications for reporting and business intelligence that can provide business analysts and developers with access to visualizing and the ability of drilling down into the data within the cube. But with the evolution of Big Data and new modern analytical and visualization tools, wouldn't you like to get that data you have locked up in Essbase out of that legacy cube and into something more accessible and flexible such as Tableau for rich and rapid visualization and wouldn't you like to have your terabytes of cube data in Cassandra and available to Apache Spark for powerful access to big data style data ETL, machine learning and mashing and correlating with other data sources?

Well, while there is no easy out of the box solution to accomplish all of this, the dream to turn your Essbase cube into another data lake that is part of your Big Data ocean and more available for rich analytics and predictive modeling and visualization is achievable with a little work and elbow grease.

Let's start by describing how you can do this. The first step is getting your data out of Essbase and this is probably the most difficult step. There are a number of ways to access data from Essbase. It first starts with understanding what "information" you want to extract. You typically don't want to directly extract the raw data that is in the Essbase cube (but you could do that as well). Such data is often too granular (one of the reasons it is in a cube), so you might need to perform some aggregations (across the dimensions) and apply some business logic as you extract the data from the cube. This is an ETL step that more or less denormalizes that data out of the cube and flattens it out into a format that will be ideal for Tableau (further downstream in the process) and applies necessary business logic to the data to get it into a more consumable form. Tableau is ideal at consuming such "flattened" information given how it extracts dimensionality out of denormalized input sources.

Often times what is typically stored in Essbase dimensions and metrics are the detailed data elements (financial, sales...etc) that might need some business transformation applied to them before extraction out of the cube. So this ETL process will prepare the data for ultimate consumption by Tableau. This is part of the art of the design modeling that goes into the overall data transformation pipeline and that requires that you must understand what category of information you are after from the raw source Essbase data that is locked in the cube. This is part of the modeling exercise you must go through and is a very critical step to get correct in order for the data to be in a structure that can be visualized by Tableau.

Now for the actual mechanics of extracting data from Essbase you have a few options for how to do this. Essbase provides a few ways to get data out of the cube.

The diagram above shows two options for extracting data from Essbase. Smart View is one option that leverages a spreadsheet approach for extracting, transforming and flattening data out of the cube for preparation to be channeled further downstream. While Smart View is not a pure programmatic API, the excel spreadsheet capabilities allow for a lot of ad-hoc exploring for getting data out of the cube and it should not be underestimated what can be done with Smart View and via the supported Essbase APIs available through Excel.

The second option shown in the diagram is using the Essbase Java API. Using the Java API allows for directly querying the Essbase database and gives very dynamic and flexible access to the cube. This can be the most robust way to get at data in the cube but is the most development intensive and a bit harder to make flexible and configurable (unlike Excel).

One thing to note is that Smart View and the Java API are not mutually exclusive. Behind the scenes Smart View is using the Java API and functions as a middleman service that allows Excel to interface with Essbase. There is a Smart View server which exposes web services accessed by Smart View. The Smart View server (aka Analytics Provider Services or APS for short) then uses the Essbase Java API to talk natively to the Essbase server directly.

The main goal of this step (whether using Smart View or Java API), is to extract the cube data that we ultimately want to feed into Tableau.

The next step is storing the extracted data described in the first step. The goal here is to store the flattened data in Cassandra tables. This requires a loader custom app to take the flattened data and load into Cassandra. What is critical to consider in the design up front, is whether the load process will be purge and reload, time series DW loading (fast changing dimensional data) or change data loading DW loading (slow changing dimensional data). See diagram below.

Storing the data in Cassandra will set us up for the final stage of the process which is creating the Tableau Data Extract that will deliver the final data processing stage. Note that in setting up the data for loading into Cassandra, Spark can be used to aide in the ETL process. One often overlooked feature in Apache Spark is that it is an excellent ETL tool. In fact, often times Spark deployment efforts end up performing quite a bit of ETL logic in order to prepare data for the final stage of modeling and machine learning processing. Apache Spark is a great tool for the emerging field of realtime ETL that is powering the next evolution of data movement in Big Data environments.

The next step in the process is using the Cassandra structured data in an environment where the Cassandra tables can be made visible to Tableau for realtime extraction and modeling. This is where Apache Spark comes into the picture. Normally if you setup Cassandra as a direct data source for Tableau, you will have processing limitations as Cassandra can't perform joins and aggregations needed by Tableau, because with Cassandra the Tableau analytics will be forced to occur on the Tableau client side. However, with Spark in the equation the Tableau analytics and related processing can happen within the Spark cluster.

Here is a final picture of the major components in the workflow and processing pipeline:

While there are some pitfalls to be weary of, this is often the case in any Big Data build out. And using products like Essbase and Tableau don't make the build out any easier. It would be nice to have less moving parts, but with a sound deployment and infrastructure this architecture can be made to scale out and is practical to apply in smaller footprint deployments as well.

Here are a couple of useful links that describe in more detail how the Spark, Cassandra and Tableau integration work: 

With this architecture you get the scalability of Spark and Cassandra for both data processing and storage scale out. In addition, with this approach you avoid a common requirement with Tableau to create TDEs (Tableau Data Extracts) that are cached/stored on Tableau Server because often times source systems such as Essbase and even traditional RDBMS environments don't scale to support Tableau Server/Desktop needs for realtime aggregations and transformations. Apache Spark steps in to provide the Big Data computational backbone needed to drive the Tableau realtime visualizations and modeling. While Tableau Server is great at serving Tableau web UI and helping with some the data governance (note this is an area Tableau is improving in), Tableau's server-side storage and processing capabilities are somewhat limited (as of this writeup).

To sum things up, Essbase cubes and related reporting services are not very scalable and accessible beasts, so this is where the combination of Cassandra and Spark can help out and give Tableau a better compute backbone that can drive interactive data visualization of your Essbase cube. Hopefully this information will inspire you to look at using Tableau with Essbase and help you ultimately unlock the potential of your Essbase financial data!

Tableau and Essbase can be a great combination for building rich reporting and dashboards and without the overhead and complexity of traditional data warehousing and BI tools. Get your financial data out of Essbase and into Tableau and into the hands of our executives and decision makers. Contact Grand Logic to learn more.

Tuesday, February 2, 2016

Spark Processing for Low Latency Interactive Applications

Apache is typically thought of as a replacement for Hadoop MapReduce for batch job processing. While it is true that Spark is often used for efficient large scale distributed cluster type processing for compute intensive jobs, it can also be used for processing low latency operations used in more interactive applications.

Note this is different than Spark Streaming and micro-batching. What we are talking about here is using Spark's traditional batch memory centric MapReduce functionality and powerful Scala (or Java/Python/R APIs) for low-latency and short duration interactive type processing via REST APIs integrated directly into application code.

The Spark processing API is very powerful and expressive for doing rich processing and the Spark compute engine is efficient at optimizing data processing and access to memory and workers/executors. Leveraging this in your interactive CRUD applications can be a boon for application developers. Spark makes this possible with a number of capabilities available to developers once you have tuned your Spark cluster for this type of computing scenario.

First, latency can be reduced by caching Spark contexts and even caching (when appropriate) RDDs. The Job Server open source project, is a Spark related project that allows you to manage a pool of Spark contexts that essentially creates cached connections to a running Spark cluster. By leveraging Job Server's cached Spark contexts and REST API, application developers can access Spark with lower latency and enable access to multi-user shared resources and processing on the Spark cluster. Another interesting project that can useful for interactive applications is Apache Toree - check it out as well. 

Secondly, you can setup a Standalone Spark cluster adjacent to your traditional application server cluster (tomcat servlet engine cluster for example) that is optimized for handling concurrent application requests. Spark has a number of configuration options that allow a Spark cluster to be tuned for concurrent short duration job processing. This can be done by sharing Spark Contexts as described and by using the Spark fair scheduler and tuning RDD partition sizing for the given set of worker executions that keep partition shuffling to a minimum. You can learn more from this video presentation on optimizing Job Server for low-latency and shared concurrent processing.

By leveraging and tuning a multi-user friendly Spark cluster, this frees application developers to leverage Spark's powerful Scala, Java, Python and R API's in ways not available in the past to traditional application developers. With this capability you can enhance traditional CRUD application development with low-latency MapReduce type of functionality to create applications not imaginable before to developers.

With this type of architecture where your traditional application servers are using an interactive low-latency Spark cluster via a REST API, you can integrate a variety of data sources and data/analytics services together using Spark. You can, for example, mash up data from your relational database and Cassandra or MongoDB to create processing and data mashup you could not do easily with hand written application code. This approach opens up a bountiful world of powerful Spark APIs to application developers. Keep in mind of course that if your Spark operations require execution on a large set of workers/nodes and RDD partitions, this will likely not lead to very good response times. But any operation with a reasonable number of stages and that can be configured to process on one or a few partition RDDs has the potential to fit this scenario, but again something for you as the developer to quantify.

Running a Spark cluster tuned for servicing interactive CRUD applications is achievable and one of the next frontiers that Spark is opening up for application developers. This will open the door for data integrations and no-ETL computing that was not feasible or imaginable in the past. Meshing data from multiple data stores and leveraging Sparks powerful processing APIs is now accesable to application developers and no longer the realm of backend batch processing developers. Get started today. Standup a Spark cluster, tune it up for low-latency processing, setup Job Server and then create some amazing interactive services!

Monday, February 1, 2016

Temporal Database Design with NoSQL

Managing data as a function of time in a database is a common requirement for many applications and data warehousing systems. Knowing when a data element or group of elements have changed and over what period of time the data is valid over, is often a required feature in many applications and analytical systems.

While not easy compared to traditional CRUD database development, supporting this type of bitemporal management functionality using a traditional RDBMS such as MySQL or Oracle is a fairly well understood by data modelers and database designers. Such temporal data modeling can be done in a variety of ways in a relational database for both OLTP and OLAP style applications. For example, Oracle and IBM DB2 have built-in extensions for managing bitemporal dimensionality at the table and schema level. It is also possible to roll your own solution with any of the major RDBMS engines by applying time dimension columns (very carefully) to your schema and then with the appropriate DML and transactions manage the updating and insertion of new change records. To do this precisely and 100% consistently the database is required to support durable ACID transactions, something all RDBMS have in spades. See wikipedia links for a background on temporal database models.

Now this is all great, temporal and bitemporal table/schema design is an understood concept by data architects in the RDBMS world. Now how do you do this if you are on the Big Data and NoSql bandwagon? To begin with most NoSQL databases lack support for ACID transactions, which is a prerequisite for handling temporal operations on slow changing dimensions (temporal data) and bitemporal dimensions (valid time dimension and transaction time dimension). ACID transactions are required in order to properly mark expired records as new records are being appended. Records must never overlap and must properly and precisely be expired as new valid time and transaction time record slices are added to the database.

NoSQL databases such as Cassandra and Couchbase are powerful database engines that can be leveraged for a wide segment of data processing and storage needs. NoSQL databases offer many benefits including built in distributed storage/processing, flexible schema modeling and efficient sparse data management. Many of these benefits come at a price although that limit NoSQL database applicability in cases where durable ACID transactions are required for scenarios such as managing multi-row, multi-table transactions for both OLTP and OLAP data processing.

To address this limitation in NoSQL databases, a NoSQL such as Couchbase or Cassandra, for example, can be paired with an ACID database in such a way (the pairing is both operationally and at a schema design level) as to allow using the NoSQL database for what is best at while supporting bitemporal operations via pairing with a RDBMS. Under the hood this is done seamlessly by having a data serialization and deserialization API that synchronizes and coordinates DML operations between the RDBMS and the NoSQL database. The schema design structure provides a polyglot database framework that supports temporal and bitemporal data modeling and provides a data access and query API that supports durable bitemporal operations while supporting the flexibility and advantages of a NoSQL database modeling (document, key/value...etc).

This approach can be applied to NoSQL databases in both OLTP, data warehousing and Big Data environments. So leverage your favorite NoSQL database with best of both worlds! Get your polyglot engines going, your favorite NoSQL database just got bitemporal! Contact Grand Logic to learn how we can help you build your next bitemporal database environment.

Monday, January 11, 2016

Big Data Warehouse with Cassandra & Spark

Enterprise Data warehousing (EDW) has traditionally been the realm of big iron databases such as Oracle, IBM and other vertical storage engines such as Teradata. With the rapid evolution of Big Data in the past few year, the market has begun to shift away from monolithic and highly structured data storage engines that lack inherent support for the tenants of Big Data.

While data warehousing (DW) design has traditionally implied denormalization and focusing on data structures that are more in tune with the applications using it (sounds a bit like NoSQL philosophy don't it), many of the Big Data storage options and NoSQL databases lack some of the needed functionality (at least out of the box) to allow for the needed ad-hoc querying capabilities and analytics required to support a data warehousing solution.

Enter into the picture Cassandra and Spark. These are two products that together can allow you to build your own robust and flexible data warehousing and analytics solution,  and doing this while running on top of a big data centric compute and storage grid environment. Together Cassandra and Spark complement each other to allow for flexible data storage and rich query and analytics processing and computing.

Cassandra is widely known in the industry for its modular scaling, built-in partitioning and replication. Cassandra's query interface (CQL), has some of the benefits of SQL while allowing for the benefits of NoSQL semi-structure data and wide column scaling and sparse row capablites. But with many of Cassandra's powerful NoSQL features come inherent limitations such as the ability perform aggregations operations and rich analytics functions within Cassandra. And as with all NoSQL (non relational) storage engines, joining tables is not something offered by Cassandra. These are significant gaps to building a data warehouse.

This is where Spark and Spark's integration with Cassandra fills the feature gap needed for Cassandra to deliver the capablilies necessary for a fully capably data warehousing platform. Spark's data management capabilities via RDDs (Resilient Distributed Datasets) and Sparks powerful distributed compute fabric combine to provide the ability to build a robust and highly scalable storage and analytics data warehousing solution.

One of the big benefits of building your DW solution on Cassandra and Spark is you get all the benefits of Big Data scaling (compute and storage scaling) while running on commodity hardware and while leveraging Spark's elegant programing interfaces (Scala, Java, Python, R). And with Spark you have room to build machine learning and other deep analytics on your data and without the lock-in and limitations of legacy big iron data warehousing engines.

Rollup your selves and start your own journey to build your next Big Data Warehouse using Spark and Cassandra.