Foundations for a Data Warehouse on the Cloud

Many of us who have spent time in the industry are quite familiar with data storage for transactional system (e.g. making a purchase) which is a system of records and have heard of Data Warehouses as part of the corporate information factory in the quest of “single version of truth” management reports.   But a trip down the memory lane is necessary to set the context right for those who are just coming of age so that all of us can be on the same page before we talk of modern data systems.

What is a Data warehouse and why is it needed?

From simple mechanisms for holding data like punch cards and paper tapes fifty years ago to real-time data processing systems like Hadoop, data storage systems have come a long way to become what they are now in the “serverless” cloud era. For over 30 years, data warehouses have been a rich business-insights source. 

A data warehouse consolidates data from disparate sources and performs analytics on the aggregated data to add value into the business operations by providing insights. Data warehouses are the custodians of the most important business data in the enterprise for the last three decades. As enterprises are increasingly becoming data driven, data warehouses play an increasingly critical role in their digital transformation journey.

With all the transformations in the sphere of cloud and information technologies, it may seem as if data warehousing has lost its relevance. Though there are countless options for storing, analyzing, and indexing data, data warehouses have remained relevant.

Data warehouses still form the foundation of enterprises’ analytics strategy and in fact the warehouse use-cases have gone beyond traditional operational reporting. Today, enterprises need to:

  1. Have a 360⁰ view of their businesses: As cost of storage and data processing reduce, enterprises want to process, store, and analyze all relevant datasets, both internal and external to their organization.

  2. Be situationally aware of and responsive to real-time business events: Enterprises need to gain insights from real time events and not wait for days or weeks to analyze data. The data warehouse needs to reflect the present state of business at all times

  3. Reduce time to insights: Enterprises need to get up and running fast without waiting days or months. for hardware or software to be installed or configured.

  4. Make insights available to business users to enable data driven decision making across the enterprise: In order to embrace a data-driven culture, enterprises need to democratize access to data.

  5. Secure their data and govern its use: Data needs to be secure and accessible to the right stakeholders inside and outside of the enterprise.

As enterprises look to expand their usage of the traditional data warehouses with growing data volumes, they face tremendous challenges as their cost continues to spiral out of control due to higher TCO. Traditional data warehouses were not designed to handle the explosive growth in data and certainly not built for emerging data processing patterns.

Modern data warehouses on the cloud 

For decades, enterprises have relied on traditional on-premises data warehouses to collect and store their most valuable data. But these traditional data warehouses can be costly, inflexible, and difficult to maintain, and for many, they no longer meet today’s business needs. Enterprises need an easy, scalable way to store all that data, as well as to take advantage of advanced analytic tools that can help them find valuable insights. As a result, many are turning to cloud data warehousing solutions like Redshift, Snowflake or BigQuery to name just a few top popular choices.

AWS Redshift

Amazon Web Services Redshift is a cloud-based data warehouse software for enterprises. The platform enables fast processing of massive data sets. Not only is it fit for quality data analytics, it also provides automatic concurrency querying as per workload demand. Redshift is a more self-managed solution meaning that engineers will have to spend time on resource and server management. 

Redshift is designed with the shared-nothing MPP architecture. It comprises data warehouse clusters with compute nodes split up into node slices. Individual compute nodes are assigned with the code by the leader node. The system communicates with client applications by using industry-standard JDBC and ODBC drivers. The technology can be integrated with most existing SQL-based client applications, ETL, BI, data analytics, and data mining tools.

GCP Big Query 

Developed by Google as part of its Google Cloud Platform offering, Big Query  provides opportunities for querying large data sets. This is a cost-effective multi-cloud data warehouse technology possessing machine learning capabilities. 

BigQuery possesses a serverless architecture where storage and compute are separated. The main component of BigQuery architecture is called Dremel. This is a massively parallel query engine with the functionality to read thousands of rows in seconds. Data is stored in replicated, distributed storage, and processed in compute clusters consisting of nodes. This structure provides vast flexibility and differs from traditional on-premise or node-based cloud data warehouse technologies. With such an approach under the hood, various users can put their data into the data warehouse and start analyzing that data using Standard SQL.

Snowflake

For companies  in search of easy deployment and configuration, Snowflake is an all-inclusive cloud data warehouse for structured and semi-structured data provided as Software-as-a-Service. With it, the customer doesn’t  need to select, install, or manage any virtual or physical hardware, except for configuring the size and number of compute clusters. The rest of the maintenance duties are carried by Snowflake, which makes this solution practically serverless. Unlike traditional legacy warehousing offerings, Snowflake provides more flexible, faster, and easier-to-use data storage and analytic solutions.Snowflake’s architecture is natively designed for the cloud and combined with an innovative SQL query engine. Combining the characteristics of traditional shared-disk and shared-nothing database architectures, Snowflake comprises three core layers such as database storage, query processing, and cloud services. There is a centralized data repository for a single copy of data that can be accessed from all independent compute nodes.

Reasons to choose modern cloud data warehouse products

It should be  no new revelation to anyone  that the business environment has become highly competitive over these years. To stay updated to the minute, organizations increasingly turn from traditional on-premises platforms to more advanced cloud-based data warehouses. Here’s why:

  • There’s no need to buy expensive physical hardware and hire an in-house team of specialists to maintain it.

  • Cloud warehouses are much easier to set up and run.

  • The absence of capital expenditure and low operational expenses are attractive features.

  • Great scalability opportunities come at more affordable prices.

  • The ability of modern warehouse architectures to perform complex analytical queries at a much faster pace due to their use of massively parallel processing (MPP) is an excellent option.

Evaluation Criteria for Cloud Data warehousing Products

The following picture summarizes the criteria that we look for while evaluating cloud data warehouse products.

Criteria to evaluate cloud data warehousing products

Key criteria in evaluating Cloud data warehousing options

Any of the previously mentioned data warehouse vendors are worthy candidates if you need to migrate your data into the cloud. The evaluation criteria checklist we provided here and the comparison of available products should help you assess whether or not a certain end-to-end solution fills the bill. Best approach  is to conduct a proof of concept or a product  try out/bakeoff. And yet, some companies may find it difficult to decide which solution suits their workflows best or to figure out how to implement cloud data warehouse by themselves. If this is the case, it is advisable  to opt for the professional help of third-party providers with hands-on experience in data warehousing implementation and consulting. There is no shortage of advisors but trusted cloud advisors with solid data and analytics implementation background is what you should look for.  In our next article we will share a client case scenario to illustrate how we at Veridic approach such projects.