Blog Archives

24/2/2020 5 Comments

data warehousing

Overview

Unlike many misleading tech pseudonyms (looking at you ‘growth hacking’), data warehousing is a really good name:

Warehouse: a building that collects materials, stores and packages them in a sensible order and sends the materials off to other parts of the business when required
Data warehouse: a computer system that collects, stores, processes and outputs data

Although data warehousing sounds technical (and possibly complicated), if you have ever built a spreadsheet that records data in a sensible order, you have made a simple data warehouse.

However, modern day business information is vast, complex and heterogeneous. More sophisticated systems are needed to bring order to the complexity and allow for data to be analysed.

Stages of Data Warehouses

Data warehouses vary in how they are built, largely depending on the requirements of the ultimate output, however every data warehouse should have some key elements. The key parts of a data warehouse are:

Data Sources: This could be from internal data sources, such as daily production information or sales information, or external data sources, such as weather reports or FX rates.
Staging Area: Data is downloaded into a raw format, cleansed, divided and stored temporarily to be ready for download into storage.
Storage: Cleansed data is stored. Often data is stored in relational databases, where information is stored in specific tables which are then linked via a common field. For example, a retail business might keep store income and store size in two different tables, which can be linked together using store ID. Data is then stored in easy to access formats, such as SQL databases.
Data Marts: Data is organised by topic, e.g. sales, marketing, production information, for ease of retrieval and to help users find the information they need.
Output: Reports, dashboards, visualisations. Any sort of output that requires data. If possible, outputs can be directly linked to the Data Marts or to Storage for automated reporting and easy updating.

Infrastructure Types

There are two types of hardware used to build a data warehouse:

On-prem: Data is accessed and stored via software and hardware located on the premises of the business. E.g. you have software on your laptop to access data, data is stored in a harddrive located in the office and requests are processed by a big server located in the basement.
Cloud: Data and requests for data are transferred over the internet. The hardware which stores and processes the data can be located anywhere in the world. E.g. you use your web browser to send a request (via the internet) to access the data, the request is processed by a server in a data centre in London and the data is retrieved and sent from data stored in another data centre in Manchester.

Systems are typically set up to be either on-prem or cloud, with lots of businesses taking on IT infrastructure projects to move their systems away from on-prem to cloud. However, there is a growing trend of hybrid solutions where a mixture of on-prem and cloud is used, for reasons including cost, security and speed.

Benefits of Data Warehousing

Quality: Everyone knows what ‘messy’ data looks like. Storing and processing data in an orderly manner can increase the quality of data immensely. E.g. poor quality data might store the same customer as three separate entries due to inconsistent formatting (e.g. 1) A Nother, 2) A. Nother, 3) Nother, A). A proper data warehouse process might automatically cleanse these names and store them as one customer.
Consistency: A problem I often encounter is businesses struggling to collate data across different sources, e.g. sales data and revenue data. A typical problem is being unable to link the bookings in the CRM system to revenue, so it is very difficult to understand how sales translate to revenue and cash. A data warehouse might identify sales bookings with a unique identifier which is then consistently used across invoicing and revenue systems.
Timeliness of analysis: Many finance teams I have encountered spend 50%+ of their time preparing formulaic monthly reports, and often month-end reports are finalised halfway through the following month. A data warehouse can distill that process from weeks to hours by creating a process that automatically prepares the reports in the required format. The key benefit of this is that decisions can be made much more quickly and while the data is at its most relevant.
Volume: Regular Microsoft Excel has a maximum capacity of 1,048,576 rows. This might seem like a lot, but this is increasingly (in my opinion) becoming a limiting factor in data storage. One example is a stock broker business that required analysis on their transactional data. The data set that needed analysing was over 5 billion rows! To store and analyse that amount of data requires a proper data warehousing process.

Challenges

Cost: Creating a data warehousing system requires more infrastructure than a disorganised process. This infrastructure, e.g. cloud storage, has costs which can become substantial.
Complexity: Building a high quality data warehouse is a high skilled job and some businesses do not have employees with the right skillset to build and maintain data warehouses.
Technical debt: Poorly built data warehouses can lead to legacy issues, such as the volume of data outgrowing the data warehouse’s infrastructure. The effort to patch, upgrade or rebuild data warehouses can be problematic if they have not been built for the long term.
High Latency (slow speeds): More recently, I have seen data pipelines struggle with the volume of data being processed. For static monthly reporting (e.g. downloading a board pack), this isn’t a major issue. However, if the data warehouse is feeding a dynamic tool to make decisions in real-time, a 10 second delay to load up a new cut of data is a problem.

Summary

Data warehousing is fundamental to reporting and data analytics. Although the concepts of cleansing, organising and storing data will be fairly obvious to a lot of people, the best practices of data warehousing and how all the processes fit together are less well understood.

Data warehouses are computer systems that collect, store, process and output data. Data warehouses are often split into five stages: 1) Data sources, 2) Staging area, 3) Storage, 4) Data marts and 5) Output, although there are lots different methodologies and terminologies.

The benefits of good data warehousing are significant and, from personal experience, can transform how middle & back office teams function by reducing reporting processes from weeks to minutes. The more exciting and valuable benefit of data warehousing is that it enables in-depth and real-time analysis of businesses that facilitates faster, more accurate and more insightful decision making.

5 Comments

12/2/2020 2 Comments

The Sankey Diagram

Introduction

Sankey Diagrams are used to visualise flows, processes and aggregated numbers. They are especially useful when breaking down a number into its component parts. The Sankey Diagram was first used by Matthew Sankey in 1898 to illustrate the flow of energy in a steam engine system. He wanted to visualise the energy efficiency of a steam combustion system by taking the input energy and sketching out where the energy went.

Bearing in mind Newton’s first law of thermodynamics, the conservation of energy, this chart shows the energy usage in its entirety and all of the outputs of the system are equal to the input(s) to the system.

Sankey Diagrams are primarily used in physics, particularly to visualise energy systems. However, with a bit of creative thinking, Sankey Diagrams can also be used to powerfully illustrate concepts in finance and operations.

A Sankey Diagram might be helpful if you are trying to show the composition or aggregation of a number or if you are trying to link two sets of numbers which add up to the same total but are segmented differently (e.g. breakdown of costs by country and breakdown of costs by type).

Example 1: P&L Diagram

Sankey Diagrams can illustrate the breakdown of a P&L, and how revenue is ‘used’ by costs in the business to arrive at net profit. Although a slightly abstract concept, this can really help to understand the orders of magnitude of costs at different levels of a business. This chart below separates COGs, overheads and depreciation / finance costs to understand costs at the gross profit, operating profit, EBITDA and net profit levels.

An alternative diagram could show cash inflows and outflows in a business, with cash surplus on the right hand side of the diagram. If there is a cash flow deficit, this could be included on the side of cash inflows, to represent cash useage from cash reserves rather than cash inflows during the period.

Example 2: Interview Process

Processes can be tricky to visualise in an intuitive way. Sankey Diagrams can show what’s gone where during a process, which is especially useful in multi-stage processes where the inputs can go down a variety of output paths.

One example of this is an interview process, where you have a homogenous input (applicants) and a variety of output paths that the applicant can go down, from falling at the first hurdle to finally accepting an offer. With many different possible outcomes, it can be difficult to visualise the process as a whole and understand what has happened to the applicants. Sankey Diagrams are a powerful way of showing a process in its entirety and giving a sense of scale.

Summary

Sankey Diagrams are a different breed of visualisation to the standard line, bar and pie charts, and they can be used to create very intuitive diagrams. The best use cases for Sankey Diagrams are numbers that require breaking down into their component parts or visualising processes with a numerical element. The key concept to remember is that the inputs and outputs of a Sankey Diagram must be equal, similarly to Newton's law of the conservation of energy. Although these diagrams are normally used in physics and energy flow diagrams, with a little bit of creative thinking they can be applied to a much broader range of topics.

Coder’s Corner

Creating Sankey Diagrams is a bit tricky, mainly because they are not standard visualisations that are readily available in common software. I think that the easiest way to create simple Sankey Diagrams is in Microsoft Power BI (the P&L diagram was created using this software), however this software has limited customisation options for Sankey Diagrams.

If you are desperate, you can create illustrative Sankey Diagrams just using the shapes in PowerPoint or Google Slides, however the scales and sizes will probably be incorrect and creating the chart might be time consuming.

If you are more confident in Python, matplotlib includes Sankey Diagrams that can be customised and joined together. The interview process chart was created using Google Colabs with the below code. I cheated slightly by removing the labels in Python and re-including them as a text box in Google Slides, because I wanted to have text in different colours which is tricky to do in Python.

Code ->>

# Import library
import matplotlib.pyplot as plt
from matplotlib.sankey import Sankey

# Set up chart
fig = plt.figure(figsize = (40,40))
ax = fig.add_subplot(1, 1, 1, xticks=[], yticks=[], title="")
sankey = Sankey(ax=ax, scale=.05, offset=0.3, unit='')

# Stage 1 - Applicants -> Passed 1st Interview
sankey.add(flows=[50, -12, -38],
labels=['', '', ''],

# Rotate arrows for candidates leaving the process
orientations=[0, 0,-1],
trunklength = 3,
edgecolor = '#323232',
facecolor = '#323232')

# Stage 2 - Passed 1st Interview -> Outcomes
sankey.add(flows=[12, -1, -8, -3],
labels=['','', '', ''],
trunklength = 1,
pathlengths = [2,2,2,2],
orientations=[0, 1, 0, -1],

#Link Stage 1 & Stage 2
prior=0,
connect=(1, 0),
edgecolor = '#323232',
facecolor = '#ff96c9')
diagrams = sankey.finish()

# Set text size
for diagram in diagrams:
for text in diagram.texts:
text.set_fontsize(16);

2 Comments

YOUR CART

ARTICLES

data warehousing

The Sankey Diagram

Author

Archives

Categories

Sign up for more LONDONSODA