Unlike many misleading tech pseudonyms (looking at you ‘growth hacking’), data warehousing is a really good name:
Warehouse: a building that collects materials, stores and packages them in a sensible order and sends the materials off to other parts of the business when required
Data warehouse:a computer system that collects, stores, processes and outputs data
Although data warehousing sounds technical (and possibly complicated), if you have ever built a spreadsheet that records data in a sensible order, you have made a simple data warehouse.
However, modern day business information is vast, complex and heterogeneous. More sophisticated systems are needed to bring order to the complexity and allow for data to be analysed.
Stages of Data Warehouses
Data warehouses vary in how they are built, largely depending on the requirements of the ultimate output, however every data warehouse should have some key elements. The key parts of a data warehouse are:
Data Sources: This could be from internal data sources, such as daily production information or sales information, or external data sources, such as weather reports or FX rates.
Staging Area: Data is downloaded into a raw format, cleansed, divided and stored temporarily to be ready for download into storage.
Storage: Cleansed data is stored. Often data is stored in relational databases, where information is stored in specific tables which are then linked via a common field. For example, a retail business might keep store income and store size in two different tables, which can be linked together using store ID. Data is then stored in easy to access formats, such as SQL databases.
Data Marts: Data is organised by topic, e.g. sales, marketing, production information, for ease of retrieval and to help users find the information they need.
Output: Reports, dashboards, visualisations. Any sort of output that requires data. If possible, outputs can be directly linked to the Data Marts or to Storage for automated reporting and easy updating.
There are two types of hardware used to build a data warehouse:
On-prem: Data is accessed and stored via software and hardware located on the premises of the business. E.g. you have software on your laptop to access data, data is stored in a harddrive located in the office and requests are processed by a big server located in the basement.
Cloud: Data and requests for data are transferred over the internet. The hardware which stores and processes the data can be located anywhere in the world. E.g. you use your web browser to send a request (via the internet) to access the data, the request is processed by a server in a data centre in London and the data is retrieved and sent from data stored in another data centre in Manchester.
Systems are typically set up to be either on-prem or cloud, with lots of businesses taking on IT infrastructure projects to move their systems away from on-prem to cloud. However, there is a growing trend of hybrid solutions where a mixture of on-prem and cloud is used, for reasons including cost, security and speed.
Benefits of Data Warehousing
Quality: Everyone knows what ‘messy’ data looks like. Storing and processing data in an orderly manner can increase the quality of data immensely. E.g. poor quality data might store the same customer as three separate entries due to inconsistent formatting (e.g. 1) A Nother, 2) A. Nother, 3) Nother, A). A proper data warehouse process might automatically cleanse these names and store them as one customer.
Consistency: A problem I often encounter is businesses struggling to collate data across different sources, e.g. sales data and revenue data. A typical problem is being unable to link the bookings in the CRM system to revenue, so it is very difficult to understand how sales translate to revenue and cash. A data warehouse might identify sales bookings with a unique identifier which is then consistently used across invoicing and revenue systems.
Timeliness of analysis: Many finance teams I have encountered spend 50%+ of their time preparing formulaic monthly reports, and often month-end reports are finalised halfway through the following month. A data warehouse can distill that process from weeks to hours by creating a process that automatically prepares the reports in the required format. The key benefit of this is that decisions can be made much more quickly and while the data is at its most relevant.
Volume: Regular Microsoft Excel has a maximum capacity of 1,048,576 rows. This might seem like a lot, but this is increasingly (in my opinion) becoming a limiting factor in data storage. One example is a stock broker business that required analysis on their transactional data. The data set that needed analysing was over 5 billion rows! To store and analyse that amount of data requires a proper data warehousing process.
Cost: Creating a data warehousing system requires more infrastructure than a disorganised process. This infrastructure, e.g. cloud storage, has costs which can become substantial.
Complexity: Building a high quality data warehouse is a high skilled job and some businesses do not have employees with the right skillset to build and maintain data warehouses.
Technical debt: Poorly built data warehouses can lead to legacy issues, such as the volume of data outgrowing the data warehouse’s infrastructure. The effort to patch, upgrade or rebuild data warehouses can be problematic if they have not been built for the long term.
High Latency (slow speeds): More recently, I have seen data pipelines struggle with the volume of data being processed. For static monthly reporting (e.g. downloading a board pack), this isn’t a major issue. However, if the data warehouse is feeding a dynamic tool to make decisions in real-time, a 10 second delay to load up a new cut of data is a problem.
Data warehousing is fundamental to reporting and data analytics. Although the concepts of cleansing, organising and storing data will be fairly obvious to a lot of people, the best practices of data warehousing and how all the processes fit together are less well understood.
Data warehouses are computer systems that collect, store, process and output data. Data warehouses are often split into five stages: 1) Data sources, 2) Staging area, 3) Storage, 4) Data marts and 5) Output, although there are lots different methodologies and terminologies.
The benefits of good data warehousing are significant and, from personal experience, can transform how middle & back office teams function by reducing reporting processes from weeks to minutes. The more exciting and valuable benefit of data warehousing is that it enables in-depth and real-time analysis of businesses that facilitates faster, more accurate and more insightful decision making.