Cloud Analytics & ML with Sam Taha

Tuesday, December 18, 2012

Turning Scripts into GUI Web Applications

How would you like to be able to turn any Linux bash script, command line program or Windows batch script into a GUI driven web application that any business user can invoke and manage from a GUI interface? And all in just a few clicks. Well, JobServer provides some great features to make this happen. JobServer, allows you to embed or associate any script or command line program to a Tasklet that can then be run in a server-see job. And the job can be configured to be invoked and run by any business user from JobServer's web UIs. The script can be manually launched by the user (and customized on the fly) from the GUI or scheduled to run at later time or frequency.

For example, an IT administrator or developer can use JobServer to embed their Linux script into JobServer and then expose it as a user interface based web application for any business user to manually run, schedule, monitor and track. The administrator can easily customize and parameterize the job/tasklet to allow custom input parameters that the business user can pass in directly from the web GUI.

Give JobServer a try, and turn any Linux or Windows script or command line program into a GUI application that be run and tracked by business users. What is also great about this is that the developer or IT administrator has full control over the scripts being run and can throttle capacity and disable/enable the scripts at any time. And they can track who has been running the scripts/jobs. All this with just a few clicks and some cut and past you can turn your scripts into GUI web applications!

Download and test drive JobServer now and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Friday, December 7, 2012

JobServer 3.4.14 for Oracle RAC

We are happy to announce the release of JobServer 3.4.14 which brings expanded enterprise features to JobServer's scheduling engine and job processing platform. JobServer has always been designed from the ground up for massive job scheduling and processing scalability while being highly resilient in the face of hardware, network and database interruptions. Reliable, repeatable, reportable, and measurable job scheduling, processing and management has always been the centerpiece of our focus with the JobServer platform.

With this release, JobServer now supports Oracle RAC 11g and allows for hot failover at the database layer. By enabling Oracle SCAN configuration, JobServer can leverage Oracle RAC's dynamic failover and database routing capability allowing JobServer to continue to access critical database data and transactions during critical job scheduling and job processing functions.

"We are excited about our Oracle RAC support in JobServer as this brings another level of enterprise fault tolerance into the JobServer Platform". JobServer tames your job processing and scheduling environment in a way that is a joy for Java developers to develop and customize upon while providing powerful management and administration features for business users and IT operations administrators.

Download and test drive JobServer 3.4.14 now and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services that maximize your Big Data investment.

Wednesday, September 26, 2012

BigQuery: Data Warehouse in the Clouds

There are a lot of changes occurring these days with the Big Data revolution such as cloud computing, NoSQL, Columnar stores, and virtualization just to mention a few of the fast moving technologies that are transforming how we manage our data and run our IT operations. Big Data, powered by technologies such as Hadoop and NoSQL, is changing how many enterprises manage their data warehousing and scale their analytics reporting. Storing terabytes of data, and even petabytes, is now in the reach of any enterprise that can afford to spend the money on potentially hundreds or thousands of commodity cores and disks to run parallel and distributed processing engines like MapReduce for instance. But is Hadoop the right fit for everyone? Are their alternatives, especially for those that want more reat-time big data analytics? Read on.

A Little Background on Hadoop

With Hadoop and many related types of large distributed clustered systems, managing hundreds if not thousands of cpus, cores and disks is a serious system administration challenge for any enterprise big or small. Cloud based Hadoop engines like Amazon EMR and Google Hadoop make this a little easier, but these cloud solutions are not ideal for typical long-running data analytics because of the time it takes to setup the virtual instances and spray the data out of S3 and into the virtual data nodes. And then you have to tear down everything after you are done with your MapReduce/HDFS instances to avoid paying big dollars for long running VMs. Not to mention you have to copy your data back out of HDFS and back into S3 before your ephemeral data nodes are shutdown - not ideal for any serious Big Data analtyics.

Then there is the fact that Hadoop and MapReduce are batch oriented and thus not ideal for real-time analytics. So while we have taken many steps forward in technology evolution, the system administration challenges in managing large Hadoop clusters, for example, is still a problem and cloud based Hadoop has many limitations and restrictions as already mentioned. In its current form, cloud based Hadoop solutions are too expensive for long running cluster processing and not ideal for long-term distributed data storage. Not to mention the fact that virtualization and Hadoop are not a great fit just yet given the current state of virtualization and public cloud hardware and software technology - this is a separate discussion.

The BigQuery Alternative

So if I want to build a serious enterprise scale Big Data Warehouse it sounds like I have to build it myself and manage it on my own. Now, enter into the picture Google BigQuery and Dremel. BigQuery is a serious game changer in a number of ways. First it truly pushes big data into the clouds and even more importantly it pushes the system administration of the cluster (basically a multi-tenant Google super cluster) into the clouds and leaves this type of admin work to people (like Google) that are very good at this sort of thing. Second it is truly multi-tenant from the ground up, so efficient utilization of system resources is greatly improved, something Hadoop is currently weak at.

Put your Data Warehouse in the Cloud

So now given all this, what if you could build your data warehouse and analytics engine in the clouds with BigQuery? BigQuery gives you massive data storage to house your data sets and powerful SQL like language called Dremel for building your analytics and reports. Think of BigQuery as one of your datamarts where you can store both fast and slow changing dimensions of your data warehouse in BigQuery's cloud storage tables. Then using Dremel you can build near real-time and complex analytical queries and run all this against terabytes of data. And all of this is available to you without buying or managing any Big Data hardware clusters!

Modeling Your Data

In a classical Data Warehouse (DW), you organize your schema around a set of fact tables and dimension tables using some sort of snowflake schema or perhaps a simplified star schema. This is what is typically done for RDBMS based data warehouses. But for anyone who has worked with HDFS, HBase and other columnar or NoSQL data stores, this relational model of a DW no longer applies. Modeling a DW in a NoSQL or columnar data store requires a different approach. And this is what is needed when modeling your DW in BigQuery's data tables.

Slow Changing Dimensions

Slow Changing Dimensions (SCD) are straight forward to implement with a BigQuery data warehouse. Since typically in a SCD model you are inserting new records each time into your DW. SCD models are common when you are creating periodic fixed point in time snapshots from your operational data stores. For example, quarterly sales data is always inserted into the DW tables with some kind of time stamp or date dimension. With a BigQuery data store you would put each record into each BigQuery table with a date/time stamp. So your ETL would like something like this:

Nothing special here with this ETL diagram other than the data is moving between your enterprise to the Google Cloud. The output ETL is directed to BigQuery for storage in one or more BigQuery tables (note this can be staged via Google Cloud Storage). But now keep in mind that when creating a Big Data Warehouse, you are typically storing your data in a NoSQL, Columnar or HDFS type data store and thus you don't have a full RDMBS and all the related SQL join capability, so typically you must design your schemas to be much more denormalized than what is normally done in a DW. But BigQuery is a hybrid type data store so it does allow for joins and provides rich aggregate functions. How you model the time dimension is of particular importance - more on this later. So your schema for a SCD table might look like something like this:

Key(s)... | Columns... | EffectiveDate

The time dimension in this case is directly collapsed into what would normally be your fact table and you would want, as much as possible, to denormalize the tables so your queries require minimal joins. As noted Dremel allows for joins but requires that at least one of the tables in the join be "small". Where small means less than 8MB of compressed data.

So now in Dremel's SQL language to select a specific record, for a particular point in time, you would simply perform a normal looking SQL statement such as this:

SELECT Column1 FROM MyTable WHERE EffectiveDate=DATE_OF_INTEREST

This query will select a record at a known date. With this approach, you can for example query for sales quarterly data where you know the records must exist for that particular date. But what if you want the most "current" record at any given point in time? This is actually something Dremel and BigQuery excel at, because it gives you SQL functionality, such as subselects, that are not typically found in NoSQL type storage engines. The query would look like this:

SELECT Column1 FROM MyTable WHERE EffectiveDate = (SELECT EffectiveDate FROM MyTable WHERE EffectiveDate <= EffectiveDate)

This query can sometimes be considered bad practice in a standard RDBMS (especially for very large tables), because of performance considerations of the subselect. However, with Dremel, this is not a problem given the way Dremel queries scale out and the fact that they do not rely on indexes.

Fast Changing Dimensions

Fast Changing Dimensions (FCD) require a bit more effort to create in a typical DW, and this is no different with BiqQuery. In a FCD, you are often capturing frequent or near real-time changes from your operational data stores and through your ETL moving the new data into your DW. Your ETL engine must normally pay mind to when to insert a new fact or time dimension record and it often involves "terminating" the previously current record in the linage of a record history set. But buy leveraging the power of Dremel, FCD can be supported in BigQuery by just inserting a new record when the on-premises ETL engine detects a change, without terminating existing current records. And because you can perform the effective date based sub select, noted above, there is now no reason to maintain an effective/termination date fields for each record. You only need the effective date.

This makes the FCD schema model, stored in BigQuery, identical to the SCD model for managing the time dimension, however there is a catch. The ETL process must maintain a "Staging DW" of the records that exist on the BigQuery side. This Staging DW only holds the most current records of your table that exists in BigQuery, so this keeps it lean and it will not grow larger over time.

So with this model your ETL will only send changes to the Google Cloud. This overall approach for FCD is useful for modeling ERP type data, for example, where records have effective and termination dates and where tracking changes is critical. Here is a diagram of the FCD ETL flow:

Note, for the case of FCD model that is non ERP centric (data model does not depend on effective/termination date semantics), the Staging DW will not be required. This is typically the case when you are just blasting high volume loosely structured data into BigQuery, such as logs events or other timestamped action/event data. In this case, you don't have to detect changes and can just send the data to BigQuery for storage as it comes in.

Put your Data Warehouse in the Cloud

At Grand Logic we offer a powerful new way to build and augment your internal data warehouse with a BigQuery datamart in the Google cloud. Leveraging our real-time and batch capable ETL engines we can move your fast or slow moving dimensional data into unlimited capacity BigQuery tables and allow you to run real-time SQL Dremel queries for rich reporting that will scale. And do all this with little upfront costs and infrastructure compared to managing your own HDFS and HBase cluster in Hadoop, for example.

With our flagship automation engine and ETL engine, JobServer, we can help you build a powerful data warehouse in the Google cloud with rich analytics with little upfront investment that will scale to massive levels. Pay as you go with full control over your data and your reporting.

Stay tuned to this blog for more details on how Grand Logic can help you build your Data Warehouse in the clouds. We will be discussing more details of our JobServer product and how our consulting services can get you going with BigQuery.

Contact us to learn how our JobServer product can help you scale your ETL and Data Warehousing into the cloud.

Tuesday, September 18, 2012

The Big Data Evolution Will Continue - No Kidding

Big Data is very much about discovering information locked in your mountains of data that come out of your production center, IT operations, enterprise systems, and back office databases. Information is all in the eye of the beholder so one person's junk is another person's gold. These days with the volumes of social data and device data growing at astronomical levels there is a lot of data to sift through and make sense out of.

While it is true that the more data you can capture the more possible information to discover there is a limit to this. I think we are going through a cycle where capturing and trying to make sense out of vast volumes of data (social data, sensor data....etc) is becoming more economical and somewhat mainstream with respect to technology and tools. However, this is a cyclic I believe, at some point business will realize that maybe they are getting diminishing returns on all this data they are capturing and storing. For example, do I really care what I tweeted 20 years ago (20 years from now). I probably will never have the time to go back and look at it and I am not sure it is valuable to any marketing person (but who knows).

There is definitely gold to be mined in many data sets that now go untapped and technologies like Hadoop, BigQuery, Storm to name a few are good tools to use but not everything fits into the Big Data tent either.

There has been a lot of hype around Big Data these days and I see a lot of people trying to fit problems that really have no reason being shoehorned into Hadoop, other than it being the cool thing to do. You could do the data crunching in easier ways for example. However, the tool sets are expanding to give developers, scientist and business people more options when deciding how to store and analyze their data.

When thinking of Big Data first ask yourself the following question:

1) How much data do I want to capture and store (do you need to persist detailed records/data?)
2) How fast is this data being created (velocity).
3) How long do I want to keep it (forever?).
4) How long am I willing to wait to get "information" when I run my analysis (batch/hourly/daily or real-time).
5) What will cost me to keep all this data around and do I have the system admin muscle to do this?

This might help you determine in which of the particular emerging Big Data technology buckets your problem best fits and which approach to take (cloud cluster, on-premises cluster...etc).

Sunday, July 8, 2012

Big Data Automation in the Cloud

Grand Logic is happy to announce expanded support for cloud analytics and big data automation services through our flagship product, JobServer. With JobServer, enterprises of all sizes, from startups to Fortune 100 companies can leverage the power of the cloud to tap the full potential of cloud based Big Data computing and analytics processing.

With solutions such as Amazon EMR and Google BigQuery growing in adoption and becoming economically advantageous, business now more than ever need to automate the flow of data between their enterprise storage systems and the cloud. Moving data and information between corporate intranets and the cloud is vital for efficient cloud based Big Data processing.

JobServer's point and click automation and scheduling tools are ideal for centrally managing the flow a data between your Big Data cloud systems such as Amazon EMR and Google BigQuery. JobServer can manage the flow of data to orchestrate the loading and retrieval of data between your Big Data processing systems in the cloud while tracking all your Big Data job processing jobs to give you one place to see everything that is happening in your Hadoop or BiqQuery analytics processing.

In a typical deployment, JobServer sits on your corporate intranet and can load and move data between your in-house storage systems into the cloud for efficient processing then track all Big Data job processing activity to return the necessary critical data and results back in-house or to move it around in the cloud (for example, move data into and out of S3...etc). Alternatively, JobServer can also be easily deployed on the Amazon EC2 or Google Compute Engine instances and run in the cloud. There are multiple topologies possible based on your business operations.

JobServer comes with a built-in and open source plugin API that makes it easy to script and customize Amazon or Google web services apis and create custom tasks and jobs using Java, web services, GWT and python/ruby/bash scripts. For example, you can create complex map reduce jobs in JobServer and get notified when processing is completed and be alerted of any issues at every stage of processing. JobServer lets you also schedule and track detailed realtime and historical reports on all job processing activities whether you are running a Hadoop job, loading a table into the cloud, pulling data back out of BigQuery temp tables, or tracking the progress of BiqQuery batch processing jobs.

JobServer gives you central control over any automation task you want to perform in the cloud or between activities happening in the cloud and your local enterprise storage and database systems. Try JobServer today and see how you will wonder how you operated without it.

About Grand Logic
Grand Logic delivers software solutions that automate business processes and tame your Big Data operations. Grand Logic delivers automation software and Hadoop consulting services that maximize your Big Data investment.

Friday, February 17, 2012

JobServer Support on Mac OS X

Grand Logic is happy to announce the release of JobServer 3.4.4. For all those Apple fans, this release provides support for JobServer on Mac OS X. You can now install and deploy JobServer on your favorite Mac. This release includes minor bug fixes.

Download and test drive JobServer 3.4.4 now and learn more about JobServer's powerful developer SDK, soafaces, that makes extending and customizing JobServer and developing custom jobs and backed automated services easier, while using some of the best Java/AJAX and web/SOA open source technology available to developers.

About Grand Logic
Grand Logic is dedicated to delivering software solutions to its customers that help them automate their business and manage their processes. Grand Logic delivers automation software and specializes in mobile and web products and solutions that streamline business.

Tuesday, February 14, 2012

Enterprise Job Scheduling for Big Data & Hadoop

Businesses of all sizes are looking beyond traditional business intelligence taking a more broader approach to BI that goes beyond the traditional data warehouse and operational database technologies of the past. With the explosion of social communication, mobile device data and many other forms of unstructured data coming into focus, businesses are now more interested than ever to ask questions about their data and their customers that they could not ask before.

Hadoop type solutions lets businesses build out this new BI 2.0 type architecture and begin to leverage their data and operations in new ways in order to ask questions that they could not have imagined possible in the past. Hadoop analytics lets businesses ask questions and build reporting solution that effectively leverage massive (yet commodity) processing power and manipulate terabytes of data that where not practical for the average enterprise to do before.

Hadoop provides a broad stack of solutions from cpu/compute clustering, parallel programming, distributed data management, advanced ETL and NoSQL type data management....etc. Hadoop is also moving quickly to build more advanced resource management to allow more efficient job flow processing on larger clusters for the bigger deployments that may have hundreds or thousands of nodes and need to run many jobs concurrently.

Hadoop comes with a few internal capacity type schedulers for managing internal cluster load and resource management, but these are strictly for internal cluster capacity scheduling between nodes and are not functional or calendar based job scheduling tools. Vanilla Hadoop distributions do not include often necessary features required by enterprises to manage and automate the full ecosystem and life-cycle of data processing typically needed by an enterprise to effectively support an end to end BI solution. In most cases an enterprise's IT group must build the necessary infrastructure to smoothly integrate Hadoop into their IT environment and avoid a lot of manual labor and impedance mismatches between their Hadoop operations and their traditional enterprise operations.

This is where JobServer, an enterprise job scheduler, comes into play. JobServer integrates with Hadoop at an enterprise IT level, letting analysts and IT administrators schedule and integrate their IT operations into the Hadoop stack. JobServer leverages a very open and flexible Java plugin API to let Java developers integrate their customizations tightly into JobServer and into Hadoop. Often times what is needed is high level job and workflow automation in order to schedule ETL processing from operational data stores in order to pump data into your Hadoop stack and to schedule jobs to run on regular interval based on business rules and business needs.

JobServer provides the job automation and job scheduling needed to accomplish this, plus it offers key features such as audit-trails to track what jobs where run, when, and edited by whom for example. JobServer, for example, can be used to coordinate and orchestrate a number of Hadoop job flows together into a larger job flow and then take the output and pump it back out into your enterprise reporting systems and enterprise data warehouses. JobServer provides a number of GUI reporting features to let enterprise users from programmers and IT staff to track what is going on in your Hadoop and IT environment and to be alerted quickly of problems.

If you need to tame your Hadoop operations and provide automated and tight integration with your existing IT environment, applications and reporting solutions, give JobServer a look. It can be a great asset to help you run your Big Data operations more efficiently. Visit the JobServer product website for more details.

Contact Grand Logic and see how we can help you make better sense of your Big Data environment. JobServer is also partnering with other Big Data solution providers and major distributions to provide complete Big Data solution for both your in house and cloud Hadoop deployments. Please contact Grand Logic for more information to see how our products can services can make your Hadoop deployment a success.

Tuesday, February 7, 2012

Native Multi-Tenant Hadoop - Big Data 2.0

For Hadoop to gain wider adoption and lower the barrier of entry to a broader audience it must become much more economical for businesses of all sizes to manage and operate a Hadoop processing cluster. Right now it takes a significant upfront investment in hardware and IT knowhow to provision the hardware and the necessary IT admin skills to configure and manage a full blown Hadoop cluster for any significant operation.

Cloud services like Amazon Elastic Map Reduce help reduce some of this but they can quickly become costly if you need to do seriously heavy processing and especially if you need to manage data in HDFS as opposed to constantly moving it between your HDFS cluster and S3 in order to shutdown datanodes to save cost as is the standard with Amazon EMR. Utilities like Whirr also help push the infrastructure management onto the EC2 cloud but again here for serious data processing this can quickly become cost prohibitive.

Operating short lived Hadoop clusters can be q useful option, but many organizations need long running processing and need to leverage HDFS for longer-term persistence as opposed to just a transient storage engine during the lifespan of MapReduce processing as is the case of Amazon EMR. For Hadoop, and Big Data in general, to make the next evolutionary leap for the boarder business world, we need a fully secure and multi-tenant Hadoop platform. In such as multi-tenant environment organizations can share clusters securely and manage the processing load in very controllable ways. And also allow each tenant to customize their Hadoop job flows and code in an isolated manner.

Hadoop already has various capacity management scheduling algorithms but what is needed is higher order resources management that can full isolate between different organizations for HDFS security and data processing purposes to support true multi-tenant capability. This will drive wider adoption within large organizations and by infrastructure services providers because it will increase the efficient utilization of unused CPU and storage just in same way that SaaS has allowed software to achieve greater economies of scale and services and democratize software for small and big organizations alike.

Native multi-tenant support in Hadoop will drastically reduce the upfront cost of rolling out a Hadoop environment and make the long-term costs much more cost effective and open the door for Hadoop and Big Data solutions to go mainstream in much the same way that Salesforce, for example, has created a rich ecosystem of solutions around business applications and CRM. This will also allow organizations to keep long-running environments and keep their data in HDFS for longer periods of time allowing them be more creative and spontaneous.

Thursday, January 12, 2012

End to End Big Data Solution

Grand Logic announces end to end Big Data solution. Our flag ship product, JobServer, and its supporting open source SDKs provide a superior platform for taking your raw data and creating business solutions that will drive ROI and deliver on the promise of Hadoop.

Hadoop is a great solution, but alone it is an island of data processing, algorithms and open source tools. JobServer integrates Hadoop into your enterprise to automate the flow of data and manage ETL processing to efficiently organize and track your Hadoop processing. Then it delivers rich visualization for your Hadoop results to allow you to maximize your business objectives with Big Data. Whether you are targeting mobile, tablets or desktop/web devices, JobServer's powerful GWT based SDK can deliver a rich user experience and visualization for your reports and applications.

All this allows you to manage, monitor and track your Hadoop processing to deliver the control and central management you need to empower your developers and business analysts. JobServer with Hadoop allows you to acquire your data, process it and then visualize it. See this architecture diagram of our end to end JobServer/Hadoop solution stack.

Contact Grand Logic and see how we can help you make better sense of your Big Data environment. JobServer is also partnering with other Big Data solution providers and major distributions to provide complete Big Data solution for both your in house and cloud Hadoop deployments. Please contact Grand Logic for more information to see how our products can services can make your Hadoop deployment a success.