Friday, September 23, 2016

Analytics as a Conversation

The pendulum is swinging in the business intelligence and analytics world. The on going technology evolution driven in part by the adoption of Big Data, machine learning and other advancements in cloud computing have made the storing, modeling and analyzing of huge volumes and velocities of data possible. The tools and IT skills needed to turn this data into rich visual information is now more possible than ever before.

Products like Tableau, Splunk, Qlik, Birst, among others, have brought rich visualization and actionable-minded analytics (actionable analytics still not that common :) to the masses. It is now easier more than ever to build rich visualizations, reports and dashboards. Building BI solutions to tackle all that data percolating around us and across social networks, IoT and within the enterprise is available to the IT masses to build compelling visual user experiences.

But there is trouble brewing on the horizon. Is there such a thing as too much data? Too much information? Too much visualization? I have built my share of BI and I have seen many amazing and compelling visualizations and dashboards using powerful solutions like Tableau and many home grown SaaS BI platforms. But I think it is time to step out of the forest and look at how humans effectively interact with information.

While we rely heavily on our visual sense, even the most well intention and minimalistic BI dashboard (and its supporting drill-down reports) might not be the best solution all the time at getting to the information you want or need. Humans have another ability for consuming information, the conversation (question and answer).

There are many technologies now converging and making it possible for us to evolve our BI stack beyond purely visualization based analytics. Analytics-as-a-Conversation (A3C) in my mind is the next frontier for BI. It does not necessary replace today's rich visualization based BI, but augments it.

What is A3C? Well, in movie terms, it is sort of the Matrix. It is about having a conversation with your BI and getting at what you need (the what) through normal human-like conversation (think texting, hashtags, tweets and even emojis). Also, this conversational form of BI is a much more natural way of interacting with complex information and can more naturally lead to asking not just the "what" questions but the "why" questions to your BI Matrix. And this form of information interrogation lends itself to setting a more clear context to the information exchange, as the BI conversation progresses from one question-answer to the next question-answer. For example, perhaps you ask your A3C system the value of a particular KPI or which KPI is the most off its norm this quarter. And then this can naturally lead to such questions as to "why is this KPI higher this quarter?"

Obviously we are not Neo and we are not talking to the Matrix, so the system has to be taught (or programmed to learn) how to converse with a human-like grammar and has to programmed to extract what it needs from the grammar/questions using NLP and then translate that into queries against the target data and metadata system. There would have to be bounds on the grammar and enough knowledge of the system's metadata to compose the proper answers. No small engineering effort, to say the least, but from where we are today with AI, bots, machine learning, NLP and general computing stacks, the technology is there to accomplish this.

Why now? Because the technologies needed to construct the BI Matrix I am describing is largely here and the data volumes are now, in my mind, overwhelming even for the best BI visualizations. With a bit of creativity (and sweat), and with current availability and advancements in Machine Learning, AI and general computing power, it is possible today to begin to build such intelligent conversational analtyics systems and user experiences. Don't forget this a about changing how the user "experiences" data.

It is not just about data volumes and technology capabilities, human interaction has itself evolved in the past decade. We have seen with the recent explosion of mobile and social communication that humans are using texting and short messages for communication more than ever and with no sign of ebbing. In fact, texting is quickly becoming the dominant form of communication and the main form of information exchange across the globe and across all demographics.

How is this better than the visualization based BI we have today? Well, I would say it is not necessarily a replacement for the BI we have today, but is instead complementary and can lead to BI answering questions of "what" and "why" that the original BI developer/modeler could not necessarily anticipate out of the box. And as artificial intelligence and machine learning systems continue to evolve and improve the potential is virtually limitless and no longer bounded by what can be rendered on a 2D display or a click of the mouse.

The revenge of the CLI (the command line interface) is upon us :) But don't underestimate the conversational CLI, it will prove to be orders of magnitude more powerful than any visualization a human can conjure up.

Stay tuned....Analytics-as-a-Conversation is coming and we will all be talking about it (or talking with it).

Tuesday, August 30, 2016

Getting Bitemporal With Your OLTP Database


The concept of a bitemporal database can seem a bit exotic and complex to be considered for a typical RDBMS schema model. While the transaction processing and query structures required to make this happen, with standard RDBMS, are more involved than a normal database model, it is a fairly straight forward design methodology to annotate every table in your RDBMS model with bitemporal semantics.

See the table below for an example for what the table structure might look like. The TT start/end columns are the transaction time dimension and the VT start/end columns are the validity time dimension. These four columns drive the basic schema model structure for a bitemporal database and enable powerful queries that can pivot and scan for data across two time dimensions without the need of a data warehouse or other complex analytics.


The table above looks straight forward, right? And it is. The bit of complexity comes with how to handle the actual data mutations (a change to a row) and insure that every row that is superseded by a a new TT and VT tuple in proper time semantics and that this is handled in a transactional consistent fashion to insure a continuous flow of tuple epochs (an epoch is a row at a particular TT/VT point in time) where the new epoch that supersedes the prior epochs properly terminates the TT and VT epochs with the start of the new TT and VT epochs.

The advantages of a bitemporal schema model are many. They include:
  1. Immutable data structures which means all tuples preserve all changes across time.
  2. Built-in audit trail functionality, since no changes are every overwritten.
  3. The ability to write fairly simple queries to view data any point in time.
  4. Easily compare any two points of time for changes.
  5. The ability to find all changes across a time range.

Injecting bitemporal capabilities into your schema will allow tracking every change that happens within a table across two time dimension: transaction time (when the mutation happened) and validity time (the time range the mutation and current state of the row is valid).

Some databases such as Oracle, DB2 and PostgreSQL have specialized extensions to support bitemporal capabilities, but you don't really need these extensions - they only help with the DDL aspect of the design and not with the DML or query aspect. For the most part, these extensions are just syntacitc sugar that you can implement on your own in a more cross database fashion and even extend to support NoSQL databases as well.

Get started with turning your schema model into a bitemporal powered RDBMS. Contact Grand Logic to learn how we can help you build your next bitemporal database environment.


Thursday, May 26, 2016

Building ML Pipelines


What is involved in building a machine learning pipeline? Here is a common flow:

  • Data pre-processing
  • Feature extraction
  • Model fitting
  • Validation stages
Learn more about ML pipelines (from a Spark perspective).


Wednesday, May 25, 2016

Tuesday, May 24, 2016

Spark Structured Streaming - Crazy Like a Fox

Big Data related computing has matured greatly over the past several years from its early and humble Map-Reduce days. Hadoop introduced developers and enterprises to mainstream distributed computing on commodity hardware and with a software stack (largely Java based) that was accessible to the average developer.

Early versions of Hadoop did not have the most developer friendly APIs, but they made breaking up large computing tasks and iterative processing possible to scale without big iron and SMP hardware. Things evolved and improved with the emergence of memory efficient Big Data engines such as Apache Spark. This has also been helped by the fact that memory prices keep dropping.

A lot of attention has been given to Apache Spark these days as the successor to Hadoop. The big advantage that Spark is touted to have over Hadoop is how its Map-Reduce engine leverages distributed memory to improve performance over classic Hadoop. While this is true, the broader Hadoop ecosystem has been evolving rapidly as well, so this alone is not at the heart of what has given Apache Spark such a big leap forward.

What is often underestimated in the growing popularity of Spark, is its API. If you have ever tried to write a Map-Reduce type job in Java Hadoop 1.x or 2.x you would understand. Spark is API plural with support for Scala, Java, Python and R. The way you build data processing pipelines and construct transformations and aggregations in Spark is well thought out by the authors of Spark.


Sparks is not standing still either. With the development of Spark Streaming, Spark SQL, DataFrames and DataSets in the Spark API, Spark is making the development effort of manipulating data and writing processing logic much more intuitive for developers. The elegance of the Spark API is a key part of the reason why Spark has grown in popularity.

One knock on Spark is that it is now being obsoleted by the next wave of compute fabric engines that are built from the ground up to be realtime streaming centric. Many claim that this streaming first architecture is superior to Spark's batch based architecture for both general purposes processing and especially for streaming operations. Products such as Storm, Fink and Apex, just to mention a few, have garnered a lot of attention. The claim is that by using a streaming first architecture, these engines can do both batch processing and streaming more efficiently than Spark does batch and micro-batch bases streaming.
What is often left out of such as debates is again the API. If you have ever tried to write a Storm processing stream you will know what I mean. So again here, this is where Spark shines with its more intuitive APIs. 

Now this is where we get to Spark's new API coming out in the soon to be release Apache Spark 2.0. Spark will be introducing a new Structured Streaming API that will unify streaming, batch and Spark SQL. The Spark team is raising the productivity bar with how developers use APIs by unifying the building of both batch and streaming applications.

The idea is that a streaming application is really a "continuous application" and that the best way to build a streaming application is not reason about streaming altogether. In other words, Spark 2.0 with Structured Streaming, will make building streaming application no different than building any other Spark application. The streaming aspect is essentially declarative and the Spark engine will do the work of optimizing the stream. The big advantage this has for developers is that we can continue to think of our applications in the same way whether they are doing streaming or batch.

Spark 2.0 with advent of Structured Streaming will leapfrog Spark ahead of the other competing streaming first engines by removing the stream design complexity while at the same time brining Spark's elegance to building APIs to the forefront.

At then end of the day, Spark's well designed APIs will prove to be pivotal for developers. Developer productivity and Spark's fast evolving optimized engine  (Tungsten...etc) will offer a hard to beat combination of developer productivity and raw scalable performance. The idea of having a programming model that does not require a developer to reason  about a stream and instead let them focus on the higher order functions of their application will in the end prove more superior vs the harder to use streaming first engines such as Storm and the like. This unified programming model also frees the Spark engine to evolve the low-level streaming plumbing over time without impacting developers.

Wednesday, May 18, 2016

Fluent Interfaces a Beautiful Thing


Fluent programming interfaces when down right are an elegant thing to behold (for a programmer). They require no specialized learning verses what it would take to build and model the same sort of domain logic in an external DSL. While specialized DSL's have their place, they create a challenging ecosystem to support and impose the need for additional moving parts outside the core development of the application and system. When the dedicated long-term resources are applied to supporting a DSL, there is no doubt external DSLs can be a powerful thing. But in the absence of this, Fluent interfaces are a powerful software programming pattern.


Here is a good video presentation describing the pros and cons of fluent interfaces vs using external DSLs. The presentation provides a pragmatic perspective from a point of personal experience in the industry.

Like anything, fluent interfaces can be abused, but when used with good intentions they can create easier to build, read and maintain software. What are good examples of fluent interfaces? There are many examples and I have noticed more frameworks and APIs supporting. Cassandra's Java driver is one example (QueryBuilder) and frameworks like Apache Spark and other general map/reduce data flow processing APIs make great use of fluent interfaces.

Here is a snippet of code I borrowed from Martin Fowlers post on the subject that gives a before and after example of using a fluent API:

private void makeNormal(Customer customer) {
        Order o1 = new Order();
        customer.addOrder(o1);
        OrderLine line1 = new OrderLine(6, Product.find("TAL"));
        o1.addLine(line1);
        OrderLine line2 = new OrderLine(5, Product.find("HPK"));
        o1.addLine(line2);
        OrderLine line3 = new OrderLine(3, Product.find("LGV"));
        o1.addLine(line3);
        line2.setSkippable(true);
        o1.setRush(true);
    }

private void makeFluent(Customer customer) {
        customer.newOrder()
                .with(6, "TAL")
                .with(5, "HPK").skippable()
                .with(3, "LGV")
                .priorityRush();
    }

So, while fluent interfaces don't give you the power of a full fledged external DSL, they can be a productive boost to any API you are building. So give fluent interfaces a look at in your next framework, they can make your code easier to build and maintain.




Saturday, April 23, 2016

Visualizing the Data Science Disciplines




Nice visualization showing how the various data science disciples interrelate. Puts some of the hype around artificial intelligence, predictive analytics and big data in some perspective.

Thursday, April 14, 2016

Understanding Supervised vs Unsupervised Machine Learning


I always found it a bit difficult to explain how labeled and non-labeled data sets factored into machine learning algorithms and the related training/modeling process. This short explanation I found on stackoverflow helped crystalize it for me:

I have always found the distinction between unsupervised and supervised learning to be arbitrary and a little confusing. There is no real distinction between the two cases, instead there is a range of situations in which an algorithm can have more or less 'supervision'. The existence of semi-supervised learning is an obvious examples where the line is blurred.

I tend to think of supervision as giving feedback to the algorithm about what solutions should be preferred. For a traditional supervised setting, such as spam detection, you tell the algorithm "don't make any mistakes on the training set"; for a traditional unsupervised setting, such as clustering, you tell the algorithm "points that are close to each other should be in the same cluster". It just so happens that, the first form of feedback is a lot more specific than the latter.

In short, when someone says 'supervised', think classification, when they say 'unsupervised' think clustering and try not to worry too much about it beyond that.

Hope you find it useful :)

Wednesday, April 6, 2016

The Industrial IoT and the Rise of Cloud Machine Learning


The Internet of Things (IoT) has been driven by advancements in many areas of technology along with the ever expanding reach of the internet. This has made it feasible today for any device big or small to be connected to the world.

Alone, having billions of devices sharing information is more or less noise. The IoT is of little value unless businesses and industries can turn raw contextual data into valuable and actionable information. We are at a turning point across all industries where the volumes of data being generated have the potential to be turned into vital business information.

The Killer App
The IoT is founded on two principles: first and foremost is the ability to efficiently collect and catalog the vast sets of data available from sensors and other internal digital systems and handling this in a timely manner. Second, it is about creating machine learning models and the related analytics that can drive predictive and prescriptive decision making opportunities for the owners of these devices and industries. With the ability to enable any device big or small to be connected, the opportunities for data gathering and intelligent decision making based on large values of timely data will propel many industrial killer applications that can turn the data into value that can be used to optimize business functions and business operations. The opportunities for an industry or business to create their own IoT killer apps is at its beginning - there are countless opportunities across all markets. The efficiencies created by using data gathered from every corner of your business will drive huge opportunities for business optimization and efficiency.

Much of the initial interest in the IoT started around consumer and retail types of scenarios for such things as tracking and monitoring consumers or optimizing product movement in supply chain scenarios. While is this is all good, we are now moving beyond the consumer aspects of this and the IoT is now invading the world of the Industrial Internet where the potential benefits have the opportunity to create tremendous business efficiencies that dwarf the opportunities in the consumer and retail opperations space and this can ultimately offer profound benefits to human advancement.

Get Your ETL Groove
Predicting failures and optimizing maintenance/operations have tremendous value across all industries from healthcare to aviation, but this is the end result of an overall process. Some of the less glamours aspects, that ultimately enable machine learning and predictive analytics, lie in the challenges of collecting this wealth of data that the downstream machine learning and analytics that are based on the data. Whether you are wind turbine power plant or a railway operator, collecting data from field operations is no small task. You can not get to useful machine learning models without gathering the data needed to train and feed your models. This often is an obstacle for industries not accustomed to collecting data using a Big Data mindset (velocity/volume/variety and context), but is a critical first step to be conquered.

Leveraging the Cloud
Here is where cloud services and machine learning PaaS solutions can help move industries from design to live deployments. Moving data into the cloud for cleansing and ETL processing is the first step to prepare your data sets for consumption by your data wranglers and data scientists. The good news is there are many startups popping up helping with this end to end process. Machine learning services are a new market for startups, and it is definitely worth looking into leveraging such services if you don't have the capacity to build the data wrangling, machine learning and predictive analytics yourself.

Leveraging the cloud is a great option for many businesses, and there are already many options to choose from. You can build your own from the ground up on an IaaS cloud environment or leverage the growing list of small and PaaS big cloud services providers coming on line. One interesting trend is the focus on domain specific machine leaning and analytics for the IoT. Companies such as Predikto, for example, focus on predictive analytical services for targeted vertical industries (rails and aviation in this case). I think we will see an increase with startups focusing an abstracting away the technology complexity and plumbing and offering end users more on the end to to end services geared toward a particular market and industry. This focus on vertical domains also aligns well with how machine learning models need to be tuned and optimized and how each vertical industries tailors its own predictive and prescriptive decision making.

Take Some Action
As we move past the first phase of the IoT where it has been about tracking and monitoring the many connected devices, in the next phase industries will be wanting to make actionable sense out of their data in ways that can improve business efficiencies and replace slow reactive human decision making with real-time decision making based on machine learning models. The ultimate goal is to reach a point where the decision making is prescriptive and even potentially AI powered, but we are a long way from that. For many businesses, it is about wrangling their data and enabling the "Things" they are tapping into to be able to feed to the cloud in order to build and drive their machine learning models and downstream analytics.

The comparison I like to make with how the IoT is transforming business, across all industries, is similar to what happened with financial marketplaces with the advent of digital trading platforms and high frequency trading systems. The financial trading platforms of today collect and monitor vast amounts of data points and everyone is looking for the most timely and  actionable information in order to beat the next guy. This is what is happening with the IoT, in large part. It is bringing to the surface vast amounts of data from every corner of a business and making it actionable. However, it will take time for the many industries, from farming to manufacturing to get their machine learning bearings. Again, don't look to build of it all yourself. There are many cloud and big data resources and even full services startups specializing in your industry that can help.

Updating Machine Learning Models
The process of making predictions and anticipating the future is based on building accurate models of your industrial world. These models are often not static. Models change over time and as the many variables that impact your business change and as your business itself grows and evolves. So a common consideration in machine learning is how to keep evolving your models. A good post on this subject can be found here. This article describes how time-series prediction (where historical data is vital to the model - like in the case of weather modeling) and how feedback data (data as a business grows - retail store start to sell new products) impacts machine learnings models and can trigger machine models to be retrained from scratch. Understand how your models evolve is an important aspect of machine learning, because without accurate models you have bad information. You are only as good as your models and keeping them up to date is a constant effort by your data wranglers and data scientists.

It is also important to appreciate that observing your business can change it as well. So you must always be looking at retraining your models as you use predictions that optimize your business. The process of optimizing your business (for the better hopefully) requires changes and updates to your machine learning models on a regular basis.

IoT + ETL + ML Models + Cloud = Optimizing Business
There are many considerations when beginning your Industrial IoT journey. There are no short-cuts, and the effort requires investing in and developing new skills and leveraging new technologies, but the journey will profoundly change your business for the better.

Friday, April 1, 2016

Law of Parsimony Strikes Back

Let me first start off by saying (hate it when people start off saying this - usually means some principled BS is coming) that many new Big Data technologies such as the concept of Map/Reduce, Machine Learning and products such as Hadoop, Spark and NoSQL databases are great tools to have in your IT arsenal. Also don't forgetting other infrastructure technologies such as hardware virtualization, software containers and other micro-services deployment architectures that are making IT environments more flexible and more manageable (note, this does mean simpler). There is no doubt these technologies fit a number of problem domains that in the past where very hard to do with standard computing stacks, IT tooling, and relational database technology.

Now having said that, let's be careful not to over apply them and end up with a system that is fragile and takes an army (or perhaps small army) of super smart operations people to deploy and run. I always go back to one of my favorite principles, "the law of parsimony" or sometimes referred to as Occam's razor. Boils down to the reality that nature has a habit of looking for the simplest path to solve a problem. We can't sometimes see simple elegance although. Take for example nature's design of the common leafy tree. Wile it has complex structure, this structure comes from some very simple principles.

I feel we are at a point with technology where we are advancing at a great pace, but in the process of doing this we are creating a lot of complexity. As Occam's razor states, complexity is a relative consideration among the alternatives, so the bar is always moving up with what we consider to be complex, but we sometimes need to step back and not use a bulldozer when a shovel will do the job just as good and will not break down on us when we need it most.


I feel that way with a lot of technology I see being applied. There are so many options to choose from that sometimes the simplest option for the problem is overlooked. This might be human nature, "he who dies with the most toys wins", but this can be a costly mistake for many businesses.

I often joke that if you give me a bunch of plain old app servers (pick your favorite) and a relational database (pick your favorite), I can move the world (might need a load balancer in there somewhere ;). So, when you in are your next architecture planning meeting ask yourself this question, can I bend my tools to my will or do I need new toys to play with :)


Saturday, March 26, 2016

The Era of Deep Learning is Here


Have to agree with Google on this. Innovation in Machine Learning & Deep Learning combined with serverless cloud platforms will turn more and more data into actionable information by making these data science services and functions available to a wider audience.

This comment from the article is telling where we are and where we have to go:
We also see data scientists complaining that they spend up to 80% of their time preparing the data and training the models before they can even begin to extract any value out of the current machine learning technologies. In fact, some data scientists sarcastically call themselves “data janitors,” because they spend more time preparing data than they do analyzing it.

It is currently fairly complex from an IT perspective to construct the infrastructure and services need for training and building learned models and effectively ingest data at the scale and velocity needed to turn raw data into value. This is also complicated by the challenge to find the necessary skill lets needed to make this happen. The market and landscape however is evolving fast on all fronts, both on the IT front as described in the Google article and with more IT skill set specializing coming available. Stay tuned :)

Monday, February 29, 2016

Essbase Analytics with Tableau, Cassandra and Spark

Using Essbase? Looking to get some of that financial, accounting, sales, and marketing data locked in your Essbase cube into something more accessable? Essbase is a very powerful platform, but it was built quite a while back (in tech time) when multi-dimensional modeling and DSL languages like MDX where a new frontier for data modeling and analytics.

 Essbase was also built when requirements on analytics, reporting and visualization were much more constrained and the expectations for realtime were not as demanding as they are now (not to mention data volumes). There are many organizations using Essbase for critical business functions so streamlining the path to quicker decision making and more robust what-if type analysis is critical to being competative and optimizing the operational performance of your business.

Oracle Essbase has a number of supporting tools for reporting and business intelligence that can provide business analysts and developers with access to visualizing and drilling down into the data within the cube. But with the evolution of Big Data and new modern analytical and visualization tools, wouldn't you like to get that data you have locked up in the Essbase cube to be made accessible to technologies such as Tableau for rich and rapid visualization or wouldn't you like to have your terabytes of cube data in Cassandra and available to Apache Spark for powerful access to big data style data ETL, machine learning and mashing and correlating with other data sources?

Well, while there is no easy out of the box solution to accomplish all of this, the dream to turn your Essbase cube into another data lake that is part of your Big Data ocean and more available for rich analytics and predictive modeling and visualization is achievable with a little work.

Let's start to describe how you can do this. The first step is getting your data out of Essbase and probably the most difficult step. There are a number of ways to access data from Essbase. It first starts with understanding what "information" you want to extract. You typically don't want to directly extract the raw data that is in Essbase cube (but you could do that as well). Such data is often to granular (one of the reasons it is in a cube), so you might need to perform some aggregations (across the dimensions) and apply some business logic as you extract it. This is an ETL step that more or less denormalizes that data out of the cube and flattens it out into a format that will be ideal for Tableau (further downstream in the process) and applies necessary business logic to the data to get it into consumable information form. Tableau is ideal at consuming such "flattened" information given how it extracts dimensionality out of denormalized input information.

Often what is typically stored in Essbase dimensions and cells is often detailed data elements (financial, sales...etc) that might need some business transformation applied to it before extraction out of the cube. So this ETL process will prepare the data for ultimate consumption by Tableau. This is part of the art of the design and where you must understand what class of information you are after from the source raw data that is in the cube. It is part of the modeling exercise you go through and is very critical to get correct in order for the data to be in a structure that can be visualized by Tableau.

Now for the actual mechanics of extracting data from Essbase you have a few options how to do this. Essbase provides a few ways to get data out of the cube.



The diagram above shows two options for extracting data from Essbase. Smart View is one option that leverages a spreadsheet approach for extracting, transforming and flattening data out of the cube for preparation to be channeled further downstream. While Smart View is not a pure programmatic API, the excel spreadsheet capabilities allow for a lot of ad-hoc exploring POC with getting data out of the cube and it should not be underestimated what can be done with Smart View and via the supported Essbase APIs available through Excel.

The second option shown in the diagram is using the Essbase Java API. Using the Java API allows for directly querying the Essbase database and gives very dynamic and flexible access to the cube. This can be the most robust way to get at data in the cube but is the most development intensive.

One thing to note is that Smart View and the Java API are not mutually exclusive. Behind the scenes Smart View is using the Java API and functions as a middleman service that allows Excel to interface with Essbase. There is a Smart View server which exposes web services accessed by Smart View. The Smart View server (aka Analytics Provider Services or APS for short) then uses the Essbase Java API to talk natively to the Essbase server natively.

The main goal of this step (whether using Smart View or Java API), is to extract the cube data that we ultimately want to see in Tableau.

The next step is storing the extracted data described in the first step. The goal here is to store the flattened data in Cassandra tables. This requires a loader custom app to take the flattened data and load into Cassandra. What is critical to consider in the design up front, is whether the load process will be purge and reload, time series DW loading (fast changing dimensional data) or change data loading DW loading (slow changing dimensional data). See diagram below.



Storing the data in Cassandra will set us up for the final stage of the process which is creating the Tableau Data Extract that will deliver the final data processing stage. Note that in setting up the data for loading into Cassandra, Spark can be used to aide in the ETL process. One often overlooked feature in Apache Spark is that it is an excellent ETL tool. In fact, often times Spark deployment efforts end up performing quite a bit of ETL logic in order to prepare data for the final stage of modeling and machine learning processing. Apache Spark is a great tool for the emerging field of realtime ETL that is powering the next evolution of data movement in Big Data environments.

The next step step in the process is using the Cassandra structured data in an environment where the Cassandra tables can be made visible to Tableau for realtime extraction and modeling. This is where Apache Spark comes into the picture. Normally if you setup Cassandra as a direct data source for Tableau, you will have processing limitations as Cassandra can't perform joins and aggregations needed by Tableau, because with Cassandra along the Tableau analytics will occur on the Tableau client side. However, with Spark in the picture this processing can happen within the Spark cluster.

Here is a final picture of the major components in the workflow and processing flow:



While there are some pitfalls to be weary of, this is the case in any Big Data build out. And using products like Essbase and Tableau don't make the build out any easier. It would be nice to have less moving parts, but with a sound deployment and infrastructure this architecture can be made to scale out and viable for supporting smaller footprint deployments as well.

Here a  couple of useful links that describe in more details how the Spark, Cassandra and Tableau integration work: 

With this architecture you get the scalability of Spark and Cassandra for both data processing and storage scale out. In addition, with this approach you avoid a common requirement with Tableau to create TDEs (Tableau Data Extracts) that are cached/stored on Tableau Server because often times source systems such as Essbase and even traditional RDBMS environments don't scale to support Tableau Server/Desktop needs for realtime aggregations and transformations. Apache Spark steps in to provide the Big Data computational backbone needed to drive the Tableau realtime visualizations and modeling. While Tableau Server is great a serving Tableau web UI and helping with some the data governance (note this is an area Tableau is improving in), Tableau's server-side storage and processing capabilities are somewhat limiting.

To sum things up, Essbase cubes and related reporting services are not very scalable and accessible beasts, so this is where the combination of Cassandra and Spark can help out and give Tableau a better compute backbone that can drive interactive data visualization of your Essbase cube. Hopefully this information will inspire you to look at using Tableau with Essbase and help you ultimately unlock the potential of your Essbase data!

Tableau and Essbase can be a great combination for building rich reporting and dashboards and without the overhead and complexity of traditional data warehousing and BI. Get your financial data out of Essbase and into Tableau and into the hands of our executives and decision makers. Contact Grand Logic to learn more.

Tuesday, February 2, 2016

Spark Processing for Low Latency Interactive Applications

Apache is typically thought of as a replacement for Hadoop MapReduce for batch job processing. While it is true that Spark is often used for efficient large scale distributed cluster type processing for compute intensive jobs, it can also be used for processing low latency operations used in more interactive applications.

Note this is different than Spark Streaming and micro-batching. What we are talking about here is using Spark's traditional batch memory centric MapReduce functionality and powerful Scala (or Java/Python/R APIs) for low-latency and short duration interactive type processing via REST APIs integrated directly into application code.

The Spark processing API is very powerful and expressive for doing rich processing and the Spark compute engine is efficient at optimizing data processing and access to memory and workers/executors. Leveraging this in your interactive CRUD applications can be a boon for application developers. Spark makes this possible with a number of capabilities available to developers once you have tuned your Spark cluster for this type of computing scenario.

First, latency can be reduced by caching Spark contexts and even caching (when appropriate) RDDs. The Job Server open source project, is a Spark related project that allows you to manage a pool of Spark contexts that essentially creates cached connections to a running Spark cluster. By leveraging Job Server's cached Spark contexts and REST API, application developers can access Spark with lower latency and enable access to multi-user shared resources and processing on the Spark cluster. Another interesting project that can useful for interactive applications is Apache Toree - check it out as well. 

Secondly, you can setup a Standalone Spark cluster adjacent to your traditional application server cluster (tomcat servlet engine cluster for example) that is optimized for handling concurrent application requests. Spark has a number of configuration options that allow a Spark cluster to be tuned for concurrent short duration job processing. This can be done by sharing Spark Contexts as described and by using the Spark fair scheduler and tuning RDD partition sizing for the given set of worker executions that keep partition shuffling to a minimum. You can learn more from this video presentation on optimizing Job Server for low-latency and shared concurrent processing.

By leveraging and tuning a multi-user friendly Spark cluster, this frees application developers to leverage Spark's powerful Scala, Java, Python and R API's in ways not available in the past to traditional application developers. With this capability you can enhance traditional CRUD application development with low-latency MapReduce type of functionality to create applications not imaginable before to developers.


With this type of architecture where your traditional application servers are using an interactive low-latency Spark cluster via a REST API, you can integrate a variety of data sources and data/analytics services together using Spark. You can, for example, mash up data from your relational database and Cassandra or MongoDB to create processing and data mashup you could not do easily with hand written application code. This approach opens up a bountiful world of powerful Spark APIs to application developers. Keep in mind of course that if your Spark operations require execution on a large set of workers/nodes and RDD partitions, this will likely not lead to very good response times. But any operation with a reasonable number of stages and that can be configured to process on one or a few partition RDDs has the potential to fit this scenario, but again something for you as the developer to quantify.

Running a Spark cluster tuned for servicing interactive CRUD applications is achievable and one of the next frontiers that Spark is opening up for application developers. This will open the door for data integrations and no-ETL computing that was not feasible or imaginable in the past. Meshing data from multiple data stores and leveraging Sparks powerful processing APIs is now accesable to application developers and no longer the realm of backend batch processing developers. Get started today. Standup a Spark cluster, tune it up for low-latency processing, setup Job Server and then create some amazing interactive services!


Monday, February 1, 2016

Temporal Database Design with NoSQL


Managing data as a function of time in a database is a common requirement for many applications and data warehousing systems. Knowing when a data element or group of elements have changed and over what period of time the data is valid over, is often a required feature in many applications and analytical systems.

While not easy compared to traditional CRUD database development, supporting this type of bitemporal management functionality using a traditional RDBMS such as MySQL or Oracle is a fairly well understood by data modelers and database designers. Such temporal data modeling can be done in a variety of ways in a relational database for both OLTP and OLAP style applications. For example, Oracle and IBM DB2 have built-in extensions for managing bitemporal dimensionality at the table and schema level. It is also possible to roll your own solution with any of the major RDBMS engines by applying time dimension columns (very carefully) to your schema and then with the appropriate DML and transactions manage the updating and insertion of new change records. To do this precisely and 100% consistently the database is required to support durable ACID transactions, something all RDBMS have in spades. See wikipedia links for a background on temporal database models.

Now this is all great, temporal and bitemporal table/schema design is an understood concept by data architects in the RDBMS world. Now how do you do this if you are on the Big Data and NoSql bandwagon? To begin with most NoSQL databases lack support for ACID transactions, which is a prerequisite for handling temporal operations on slow changing dimensions (temporal data) and bitemporal dimensions (valid time dimension and transaction time dimension). ACID transactions are required in order to properly mark expired records as new records are being appended. Records must never overlap and must properly and precisely be expired as new valid time and transaction time record slices are added to the database.

NoSQL databases such as Cassandra and Couchbase are powerful database engines that can be leveraged for a wide segment of data processing and storage needs. NoSQL databases offer many benefits including built in distributed storage/processing, flexible schema modeling and efficient sparse data management. Many of these benefits come at a price although that limit NoSQL database applicability in cases where durable ACID transactions are required for scenarios such as managing multi-row, multi-table transactions for both OLTP and OLAP data processing.

To address this limitation in NoSQL databases, a NoSQL such as Couchbase or Cassandra, for example, can be paired with an ACID database in such a way (the pairing is both operationally and at a schema design level) as to allow using the NoSQL database for what is best at while supporting bitemporal operations via pairing with a RDBMS. Under the hood this is done seamlessly by having a data serialization and deserialization API that synchronizes and coordinates DML operations between the RDBMS and the NoSQL database. The schema design structure provides a polyglot database framework that supports temporal and bitemporal data modeling and provides a data access and query API that supports durable bitemporal operations while supporting the flexibility and advantages of a NoSQL database modeling (document, key/value...etc).

This approach can be applied to NoSQL databases in both OLTP, data warehousing and Big Data environments. So leverage your favorite NoSQL database with best of both worlds! Get your polyglot engines going, your favorite NoSQL database just got bitemporal! Contact Grand Logic to learn how we can help you build your next bitemporal database environment.


Monday, January 11, 2016

Big Data Warehouse with Cassandra & Spark


Enterprise Data warehousing (EDW) has traditionally been the realm of big iron databases such as Oracle, IBM and other vertical storage engines such as Teradata. With the rapid evolution of Big Data in the past few year, the market has begun to shift away from monolithic and highly structured data storage engines that lack inherent support for the tenants of Big Data.

While data warehousing (DW) design has traditionally implied denormalization and focusing on data structures that are more in tune with the applications using it (sounds a bit like NoSQL philosophy don't it), many of the Big Data storage options and NoSQL databases lack some of the needed functionality (at least out of the box) to allow for the needed ad-hoc querying capabilities and analytics required to support a data warehousing solution.

Enter into the picture Cassandra and Spark. These are two products that together can allow you to build your own robust and flexible data warehousing and analytics solution,  and doing this while running on top of a big data centric compute and storage grid environment. Together Cassandra and Spark complement each other to allow for flexible data storage and rich query and analytics processing and computing.

Cassandra is widely known in the industry for its modular scaling, built-in partitioning and replication. Cassandra's query interface (CQL), has some of the benefits of SQL while allowing for the benefits of NoSQL semi-structure data and wide column scaling and sparse row capablites. But with many of Cassandra's powerful NoSQL features come inherent limitations such as the ability perform aggregations operations and rich analytics functions within Cassandra. And as with all NoSQL (non relational) storage engines, joining tables is not something offered by Cassandra. These are significant gaps to building a data warehouse.


This is where Spark and Spark's integration with Cassandra fills the feature gap needed for Cassandra to deliver the capablilies necessary for a fully capably data warehousing platform. Spark's data management capabilities via RDDs (Resilient Distributed Datasets) and Sparks powerful distributed compute fabric combine to provide the ability to build a robust and highly scalable storage and analytics data warehousing solution.

One of the big benefits of building your DW solution on Cassandra and Spark is you get all the benefits of Big Data scaling (compute and storage scaling) while running on commodity hardware and while leveraging Spark's elegant programing interfaces (Scala, Java, Python, R). And with Spark you have room to build machine learning and other deep analytics on your data and without the lock-in and limitations of legacy big iron data warehousing engines.

Rollup your selves and start your own journey to build your next Big Data Warehouse using Spark and Cassandra.

Wednesday, December 2, 2015

No Compromise Database with NoSQL & Apache Spark


Database technology has been going through a renaissance over the past several years. Relational databases have matured steadily over the past couple of decades, however relational databases were not well equipped to deal with the data volume, velocity and variety (three Vs) that is now demanded by the world of social apps, mobile, IoT, and Big Data - just to name a few.

We are now seeing many new database engines coming to the market (commercial and open source) geared to servicing paritcular applications domains and functional verticals. There is some awsome innovation happening, but the common theme you see with the vast majority of these databases is that they give up something from the traditional relational database world to achieve the level of, for example, CAP theorem suite spot they are aiming for or volume/scalability/throughput they are trying to achieve.

The most common tradeoff given up by many of the NoSQL database engines, for example, is the elimination of table or entity joining. Joining data sets is a fundamental part of the relational model that allows for modeling data using a normalization approach and having a schema that can server multiple application scenarios. This approach is different with NoSQL database. When designing a NoSQL database schema the modeling of the schema/data (or lack of schema - less rigid schema) is very tightly coupled with how the applications will use the schema. So NoSQL databases tradeoff the strong typing of the relation world but push more complexity to the application tier.


The fact that joining is missing from many of the popular NoSQL engines (Cassandra, MongoDB...) puts more complexity on the application tier to help offer functionality such as combining and mashing different data sources together. For example, trying to do a join between to data sets pulled from two different tables or storage engines can be complex and hard to scale in the application tier. Enter Apache Spark into the picture. With Spark, application developers can use Spark's grid computing capabilities to perform database engine type operations without reinventing the wheel in the application layer and while at the same time leveraging a highly scalable compute grid and memory management grid with built-in rich data transformation operations (RDDs, map/reduce, filters,  joins...).

Combining Apache Spark with your backend application services is a powerful way to scale NoSQL databases by allowing for rich data operations across multiple tables, documents and polyglot data sources. And this can be done while leveraging Sparks very rich and expressive APIs and highly scalable processing and memory caching.

So Spark is not just for petabyte scale Big Data number crunching and machine learning tasks. You can use Spark in your traditional data management tier to join desperate data entities and use it for rich data processing operations typically provided by relational databases. With Spark you get the benefits of NoSQL without compromise.

Embed Spark into your backend application tier and give Apache Spark a spin, it will change how you build backend services forever.

Wednesday, November 18, 2015

Understanding Apache Spark - Why it Matters


Apache Spark has come on the scene in the past few years and has taken the computing world by storm. It is dubbed as the replacement for Hadoop and often seen as the next evolution in Big Data. Spark is one of the most active Apache projects and has developed a strong ecosystem. Even the Big Data players themselves are adopting it in their stack and positioning it as a key player in their overall open source and productized solutions.

Why has Spark been so successful? How is it better or different than the first incarnation of Big Data (aka Hadoop). Well Spark does not abandon the principles that were realized by Hadoop and companies that helped bring the Big Data philosophy to the masses. Spark builds on the basic building blocks of such technologies, such as HDFS and programming constructs such as Map-Reduce and it does it in a way that makes building application on top of Spark much more efficient and effective than its predecessors.

Spark like Hadoop supports building a computing fabric that can be deployed and can run a commodity type hardware and inherently supports horizontal scaling. Spark lowers the barriers for helping application developers parallelizable their applications and spreading the computing and data access on a cluster of computers for processing. Hadoop does many of the same thing, but Spark does it better from both a technology implementation perspective (more efficient use of memory, garbage collection handling...) and much better application programming API.



What Spark does is raise the bar from a programming interface perspective. It has strong support for Java, Scala, Python and R. Its core operations for managing data (such as RDDs) and computing are very well designed interfaces and APIs. When working with Spark you still have to look at your application and the problem you are trying to solve and think how to parallelize it, but the Spark APIs are intuitive to understand and to use for the typical application programmer. Spark gives you the tools to essentially access the same power a grid computing platform has or distributed database engine might have internally and makes it available to the average programming to embed that same sophistication in their own application.

Spark is a game changer. It can be used for everything from ETL to basic application OLTP computations that drive a GUI to backend batch processing to real-time streaming applications and graph modeling. Spark is truly a game changer that will bring some of the powerful technology pioneered by the internet giants for leveraging distributed computing into applications at levels of the enterprise. Strap your boots and starting learning Spark. It is the next evolution in not just Big Data but in general purpose application programming that can leverage true distributed grid computing and bring it to the programming masses.


Monday, July 27, 2015

Unbundling Database Architecture: Turning Databases Inside-Out

Relational database technology has been around for a few decades now. In the last several years we have seen a resurgence of innovation around data storage and data processing. This has pushed us into the realm of thinking outside of traditional SQL and big iron monolithic computing.

NoSQL, NewSQL and distributed commodity/cloud storage is changing how we build persistence into our applications. However the fundamentals of databases have not changed much. Lower cost memory and the availability of cheaper cloud computing has created a lot of innovation, but how databases function under the hood has not changed very much.

The fundamentals of how transaction atomicity, replication and considerations such as CAP theorem are still tackled in much the same way as they were with the earlier database engines. But is there a different way to look at how applications manage persistence for OLTP type of transactions? Well, Apache Samza presents an interesting approach to how data is managed. While it takes things from a streaming centric approach, this could present a new way for how applications can manage general data storage in the future.

Here is an interesting blog that presents a breakdown how the Apache Samza architecture and how this can facilitate more general purpose application data management by using an "unbundled" architecture in the heart of the database engine. Is this just another specialized data storage engine geared toward steaming data and analytics, or a whole new way to think about database architecture?

Sunday, June 7, 2015

Isomorphic Web Apps: Back to the Future, Again


As web application development evolves, we continue to see the pendulum swing between client and server. Over the past two decades we have moved from simple multi-page HTML applications that are rendered exclusively on the server to ultra fat single page applications (SPA) containing more javascript than anyone would have imagined a few years ago.

Over the past couple of years, many large hosted sites (i.e. Airbnb, Facebook and others) have run into challenges with building heavy javascript client apps and have rediscovered the value of rendering some of the web content on the server. Technology such as Node.js has made this easier and so has the creation of frameworks such as ReachJS. This rediscovering of using the server for rendering UI now has a new cool name, Isomorphic Javascript. The name seams to have stuck, so we will need to add it our lexicon :)

The technology around this new approach is gaining some steam of late. Here is a good blog from from Airbnb on what led them to consider this architecture for their hosted web application services. While the idea for moving away from SPA has been around for while, it is gaining more steam of late and we will for sure start to see more of the established front-end JavaScript frameworks incorporating it in one way or another as well as new frameworks such as ReachJS.

ReactJS is one of the more popular frameworks that leverage server side rendering and that advocates for this hybrid web application development. While Node.js is the leading container for supporting this application delivery model, we will start to see JVM support and integration as well with Java 8 Nashorn.

There are many benefits to building your web application with an isomorphic javascript architecture that I will try to cover in an up coming blog. There are already some good blogs covering the subject. Also expect AngularJS 2.0 to offer support for server side rendering, but we will have to wait and see what Google comes up with as AngularJS 2.0 gets further along.

So keep an eye out for this new twist in web application development. It will will be a boost for mobile development as well since mobile can certainly benefit from some server-side offloading of processing. But like most things, this new technology approach is no free lunch. Isomorphic javascript does add some complexity to constructing your web applications. Some of this maybe alleviated as web application frameworks evolve and as HTML web component standard mature. Stay tuned.

Saturday, May 9, 2015

A Future Writen in TypeScript?


Web developers! Get your TypeScript engines started. Sad to say that Dart is dead, but TypeScript is a much more natural evolution toward ECMAScript 6 and a more team scalable, structured and manageable extension to JavaScript programming (long live static typing :) to help bring web development out of the wild wild west.



Here is how AngularJS 2.0 is influencing the future of web development:
https://blog.mariusschulz.com/2015/03/06/angular-2-and-typescript

Thursday, November 13, 2014

Web Components are Real

Web Components are not another internet buzzword. Web Components are a collection of web browser constructs and standards that will modernize client side web development and improve the web design process overall. This is a long time in the making, but these are the missing building blocks (along with continued ECMAScript maturity) that are needed to bring web development on par with traditional structured programming languages and environments without the need the crazy hacks we have today.



The key standards behind Web Components include:
  • Shadow DOM: Finally DOM trees that don't step on each other. Modular DOM structures can exist and interact with each other.
  • Custom HTML Elements: HTML building blocks where each custom element can have encapsulated properties functions and events. Elements can exist in a hierarchy/nesting and look and act like native HTML elements.
  • HTML Imports: Import HTML pages and source files like other programming languages.
  • CSS Grid Layout: Table and grid layout done in a more intuitive way and more akin to how most client GUI frameworks handle widget layout.

These standards will impact low level frameworks such as jQuery, but will also change the way higher order client side frameworks like AngularJS, GWT, Ember, Knockout evolve over time and how they provide wiring, plugin and extension capability to their developers.

So get ready for Web Components. They are real and will finally bring modular and structured web programming to the web to support more robust, scalable and maintainable, extensible client side development frameworks.

P.S. Keep an eye on the Polymer project if you want to experiment with Web Components today. This client side framework, packages many of the emerging standards into a developer friendly API and programming model. But keep in mind that Polymer is not Web Components, it is just a project that demonstrates the power of these new Web Component standards.

Sunday, September 14, 2014

JobServer Release 3.6.14

We are happy to announce the release of JobServer 3.6.14 which introduces LDAP support and improved shell script processing to allow turning any standalone program or shell script into an easy to automate and track application. Yes, with JobServer you give your shell scripts and batch standalone programs a GUI front-end that you can use to customize your shell scripts and leverage powerful reporting and monitoring to easily track all input and output related to your batch scripts and standalone programs.

With this release, JobServer now supports improved tracking of shell script output via the JobServer JobTracker reporting and tracking application. You can now preview the standard output of every shell script right from the top level JobTracker search report. You can also now run shell script jobs manually and pass custom input parameters to the shell scripts. Using JobServer with batch scripts just got a whole lot more fun and productive.

Want to simplify user authentication for you and your JobServer end users? Now with LDAP support, you can integrate JobServer with your Active Directory and LDAP compatible environment for more seamless user authentication.

Download and test drive JobServer 3.6.14 today and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop and predictive analytics consulting services that maximize your Big Data investment.

Saturday, August 16, 2014

Tableau for Agile Oracle Essbase Financial Reporting

Oracle Hyperion Essbase is an established multidimensional database platform often used by accounting departments to model and store their company's financial data. Essbase comes with some out of the box Oracle web reporting utilities to help you visualize your financials for management and also comes with integration with tools such as MS Office for reporting via Excel. The typical model is you end up passing lots and lots of excel spreadsheets around your organization and with your executives - a bit antiquated in this day and age to say the least. You will however find that Excel, to its credit, is commonly used to build fairly advanced reports with Essbase using home grown Excel and VB programming. But there has to be a better way that does not involve building complex data warehousing that typical third party BI tools require.

Now Tableau is a fast growing and popular visualization BI solution that enables business analysts (without advanted technical expertise) to perform data discover and build rich and sophisticated visualizations that can be more easily shared than traditional Excel sheet sharing. Tableau has emerged as a powerful replacement for Excel based reporting and a challenger as well for the established enterprise BI platforms such as Microstrategy and Cognos to mention a few. Tableau fits well as an agile replacement for Excel reporting while allowing users to build very powerful next generation reporting and dashboards that out perform the traditional enterprise BI vendors.

Tableau still has a way to go on the enterprise end, but it is coming on strong and if you know how to deploy implement Tableau Server you can build highly agile and visually rich enterprise grade BI solutions. For financial reporting, Tableau allows you to take your legacy Essbase reports and spreadsheets out of the dungeon and into the light of day by allowing you to build sophisticated dashboards that can be easily accessible across your orgaization via Tableau Server by all your executives.

With Tableau you can just say no to having to build yet another data warehouse and complex ETL when architecting your business intelligence strategy. But be aware, Tableau can be used to extract data from Essbase directly using the built-in Tableau to Essbase connector, but say no to this also, this will not work (needs another blog). We strongly suggest not using the Tableau Essbase cube connector for a number reasons (not all Tableau related). This connector has many challenges. A hint - extract your Essbase data using the Essbase Excel plugin and mix with a little ETL and output to denormalized flat data structures. Say what? Yes this approach rocks! Remember that Tableau is great at extracting dimensionality out of your data (that is one of its claims to fame actually).

At Grand Logic, we have developed an elegant and straight forward approach to extracting data from Oracle Essbase for agile and efficient consumption by Tableau. This in turn can be used to build advanced financial reports and dashboards without a huge investment in data warehousing and ETL processing. Our approach to integrating Tableau with Oracle Essbase leads to a powerful solution that will leave your executives wanting more and frees your accountants and financial analysts from building cumbersome to maintain Excel reports. Get your financial reporting and dashboarding in Tableau today for centralized access and one governed version of the truth and put actionable and insightful data in the hands of your executives.

Are you also looking to invest in Big Data infrastructure and analytics? Essbase does not have to be an isolated island of data divorced from your Big Data initiatives. Read more on how you can integrate Essbase data with your Big Data analytics.

Looking to get your Essbase cube into a Big Data lake? Learn more how you can integrate Essbase with Tableau and Apache Spark to supercharge your Tableau and Essbase connectivity.

Tableau and Essbase can be a great combination for building rich reporting and dashboards and without the overhead and complexity of traditional data warehousing and BI. Get your financial data out of Essbase and into Tableau and into the hands of our executives and decision makers. Contact Grand Logic to learn more.

Wednesday, February 5, 2014

Machine Learning: The Brains Behind Big Data

The first round of the data revolution has focused around commoditizing computing and storage. Platforms such as Hadoop and NoSQL have helped to propel this and have enabled businesses to economically deploy more powerful scale out infrastructure than before. It has also changed and improved the way data warehousing and business intelligence is approached and managed. The storage and performance capabilities of Big Data have been a game changer. Traditional descriptive BI and reporting will never be the same. But this is just step one. The best is yet to come.

The industry is now going through a learning processes with how to manage all this data at massive scales. Storing and managing more data is great, but people and businesses will get smarter at how much data to keep as it starts to hurt more (hurt the pocketbook). How much data you keep and mine will depend on statistically driven best practices and not just about data warehousing or how big your HDFS cluster is. The mainstreaming of Big Data has provided the muscle to store and process massive amounts of data at near linear scale, but we will not see the real value of all this Big Data storage and processing until machine learning and data science tools become more assessable (to the non-PHD data scientists among us) and mainstream and businesses learn how to apply these tools and disciplines effectively.

Machine Learning will provide the brains to go along with the Big Data muscle. In the long-run businesses will decide how much data to keep around based on statistical measures and best practices as they grow to understand their data and their business better as they build out developing their predictive and prescriptive analytics.

Sunday, October 20, 2013

JobServer and Mesos Make a Great Pair

We are happy to announce the release of JobServer 3.6 beta1 with support for Mesos clustering and distributed job processing. Release 3.6 is an early access release of JobServer with integrated support for Mesos. With this release of JobServer, you can now schedule and run jobs on a Mesos cluster of any size and configuration. Say goodbye to cron jobs!

JobServer has always had support for distributed job scheduling and processing and a great replacement for cron. Now, with Mesos integration, JobServer takes this to next level by incorporating support for dynamic resource management and reliability by leveraging all the advantages of Mesos. JobServer also brings powerful scheduling, reporting and monitoring features to Mesos environments. Distributed job scheduling and batch processing just got more interesting!

With this release you can track and manage jobs as they run across a dynamic and highly resilient cluster of servers. JobServer with Mesos allows you to run scripts and jobs across your cluster of servers and manage how resources are utilized and managed. If you are a Mesos user today, give JobServer a try and say goodbye to cron. If you are a JobServer user, get your compute resources under control with Mesos.

Download the beta release of JobServer v3.6 and tame your IT environment using all the advantages of Mesos and JobServer.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.