<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" >

<channel><title><![CDATA[London SODA - Articles]]></title><link><![CDATA[https://www.londonsoda.com/articles]]></link><description><![CDATA[Articles]]></description><pubDate>Mon, 26 Feb 2024 15:30:36 +0000</pubDate><generator>Weebly</generator><item><title><![CDATA[data warehousing]]></title><link><![CDATA[https://www.londonsoda.com/articles/data-warehousing]]></link><comments><![CDATA[https://www.londonsoda.com/articles/data-warehousing#comments]]></comments><pubDate>Mon, 24 Feb 2020 00:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.londonsoda.com/articles/data-warehousing</guid><description><![CDATA[ 	 		 			 				 					 						       					 								 					 						  OverviewUnlike many misleading tech pseudonyms (looking at you &lsquo;growth hacking&rsquo;), data warehousing is a really good name:Warehouse: a building that collects materials, stores and packages them in a sensible order and sends the materials off to other parts of the business when requiredData warehouse: a computer system that collects, stores, processes and outputs data&#8203;Although data warehousing sounds technical (and po [...] ]]></description><content:encoded><![CDATA[<div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:25.300399909804%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>				<td class="wsite-multicol-col" style="width:47.179024421971%; padding:0 15px;"> 					 						  <div class="paragraph"><span><span style="color:rgb(0, 0, 0); font-weight:700">Overview</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Unlike many misleading tech pseudonyms (looking at you &lsquo;growth hacking&rsquo;), data warehousing is a really good name:</span></span><br /><br /><ul><li><span><span style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Warehouse</span><span>: </span><span>a building that collects materials, stores and packages them in a sensible order and sends the materials off to other parts of the business when required</span></span></span></span></li><li><span><span style="font-weight:700">Data warehouse:</span><span> </span><span>a computer system that collects, stores, processes and outputs data</span></span></li></ul>&#8203;<br /><span><span style="color:rgb(0, 0, 0)">Although data warehousing sounds technical (and possibly complicated), if you have ever built a spreadsheet that records data in a sensible order, you have made a simple data warehouse.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">&#8203;However, modern day business information is vast, complex and heterogeneous. More sophisticated systems are needed to bring order to the complexity and allow for data to be analysed.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">Stages of Data Warehouses</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Data warehouses vary in how they are built, largely depending on the requirements of the ultimate output, however every data warehouse should have some key elements. The key parts of a data warehouse are:</span></span><br /><br /><ul><li><span><span style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Data Sources:</span><span> This could be from internal data sources, such as daily production information or sales information, or external data sources, such as weather reports or FX rates.</span></span></span></span></li><li><span><span style="font-weight:700">Staging Area: </span><span>Data is downloaded into a raw format, cleansed, divided and stored temporarily to be ready for download into storage.</span></span></li><li><span><span style="font-weight:700">Storage: </span><span>Cleansed data is stored. Often data is stored in relational databases, where information is stored in specific tables which are then linked via a common field. For example, a retail business might keep store income and store size in two different tables, which can be linked together using store ID. Data is then stored in easy to access formats, such as SQL databases.</span></span></li><li><span><span style="font-weight:700">Data Marts:</span><span> Data is organised by topic, e.g. sales, marketing, production information, for ease of retrieval and to help users find the information they need.</span></span></li><li><span><span style="color:rgb(0, 0, 0); font-weight:700">Output:</span><span style="color:rgb(0, 0, 0)"> Reports, dashboards, visualisations. Any sort of output that requires data. If possible, outputs can be directly linked to the Data Marts or to Storage for automated reporting and easy updating.</span></span></li></ul></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/data-warehouse_orig.jpg" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">&#8203;Infrastructure Types</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">There are two types of hardware used to build a data warehouse:</span></span><br /><br /><ul><li><span style="color:rgb(0, 0, 0); font-weight:700">On-prem</span><span style="color:rgb(0, 0, 0); font-weight:lighter">: Data is accessed and stored via software and hardware located on the premises of the business. E.g. you have software on your laptop to access data, data is stored in a harddrive located in the office and requests are processed by a big server located in the basement.</span></li><li><span><span style="font-weight:700">Cloud</span><span>: Data and requests for data are transferred over the internet. The hardware which stores and processes the data can be located anywhere in the world. E.g. you use your web browser to send a request (via the internet) to access the data, the request is processed by a server in a data centre in London and the data is retrieved and sent from data stored in another data centre in Manchester.</span></span></li></ul><br /><span><span style="color:rgb(0, 0, 0)">Systems are typically set up to be </span><span style="color:rgb(0, 0, 0)">either </span><span style="color:rgb(0, 0, 0)">on-prem </span><span style="color:rgb(0, 0, 0)">or</span><span style="color:rgb(0, 0, 0)"> cloud, with lots of businesses taking on IT infrastructure projects to move their systems away from on-prem to cloud. However, there is a growing trend of hybrid solutions where a mixture of on-prem </span><span style="color:rgb(0, 0, 0)">and</span><span style="color:rgb(0, 0, 0)"> cloud is used, for reasons including cost, security and speed.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">Benefits of Data Warehousing</span></span><ul><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Quality:</span><span> Everyone knows what &lsquo;messy&rsquo; data looks like. Storing and processing data in an orderly manner can increase the quality of data immensely. E.g. poor quality data might store the same customer as three separate entries due to inconsistent formatting (e.g. 1) A Nother, 2) A. Nother, 3) Nother, A). A proper data warehouse process might automatically cleanse these names and store them as one customer.</span></span></li><li style="color:rgb(0, 0, 0)">&#8203;<span><span style="font-weight:700">Consistency: </span><span>A problem I often encounter is businesses struggling to collate data across different sources, e.g. sales data and revenue data. A typical problem is being unable to link the bookings in the CRM system to revenue, so it is very difficult to understand how sales translate to revenue and cash. A data warehouse might identify sales bookings with a unique identifier which is then consistently used across invoicing and revenue systems.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Timeliness of analysis: </span><span>Many finance teams I have encountered spend 50%+ of their time preparing formulaic monthly reports, and often month-end reports are finalised halfway through the following month. A data warehouse can distill that process from weeks to hours by creating a process that automatically prepares the reports in the required format. The key benefit of this is that decisions can be made much more quickly and while the data is at its most relevant.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Volume:</span><span> Regular Microsoft Excel has a maximum capacity of 1,048,576 rows. This might seem like a lot, but this is increasingly (in my opinion) becoming a limiting factor in data storage. One example is a stock broker business that required analysis on their transactional data. The data set that needed analysing was over 5 billion rows! To store and analyse that amount of data requires a proper data warehousing process.</span></span></li></ul><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">Challenges</span></span><br /><br /><ul><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Cost: </span><span>Creating a data warehousing system requires more infrastructure than a disorganised process. This infrastructure, e.g. cloud storage, has costs which can become substantial.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Complexity:</span><span> Building a high quality data warehouse is a high skilled job and some businesses do not have employees with the right skillset to build and maintain data warehouses.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Technical debt: </span><span>Poorly built data warehouses can lead to legacy issues, such as the volume of data outgrowing the data warehouse&rsquo;s infrastructure. The effort to patch, upgrade or rebuild data warehouses can be problematic if they have not been built for the long term.</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">High Latency (slow speeds):</span><span> More recently, I have seen data pipelines struggle with the volume of data being processed. For static monthly reporting (e.g. downloading a board pack), this isn&rsquo;t a major issue. However, if the data warehouse is feeding </span></span><span><span>a dynamic tool to make decisions in real-time, a 10 second delay to load up a new cut of data is a problem.</span></span></li></ul><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">Summary</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Data warehousing is fundamental to reporting and data analytics. Although the concepts of cleansing, organising and storing data will be fairly obvious to a lot of people, the best practices of data warehousing and how all the processes fit together are less well understood.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Data warehouses are computer systems that collect, store, process and output data. Data warehouses are often split into five stages: 1) Data sources, 2) Staging area, 3) Storage, 4) Data marts and 5) Output, although there are lots different methodologies and terminologies.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">The benefits of good data warehousing are significant and, from personal experience, can transform how middle &amp; back office teams function by reducing reporting processes from weeks to minutes. The more exciting and valuable benefit of data warehousing is that it enables in-depth and real-time analysis of businesses that facilitates faster, more accurate and more insightful decision making.</span></span></div>   					 				</td>				<td class="wsite-multicol-col" style="width:27.520575668226%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>]]></content:encoded></item><item><title><![CDATA[The Sankey Diagram]]></title><link><![CDATA[https://www.londonsoda.com/articles/the-sankey-diagram]]></link><comments><![CDATA[https://www.londonsoda.com/articles/the-sankey-diagram#comments]]></comments><pubDate>Wed, 12 Feb 2020 09:44:03 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.londonsoda.com/articles/the-sankey-diagram</guid><description><![CDATA[																																													Introduction&#8203;Sankey Diagrams are used to visualise flows, processes and aggregated numbers. They are especially useful when breaking down a number into its component parts. The Sankey Diagram was first used by Matthew Sankey in 1898 to illustrate the flow of energy in a steam engine system. He wanted to visualise the energy efficiency of a steam combustion system by taking the input energy and sketching out where the energy went. &#8203;Bearing  [...] ]]></description><content:encoded><![CDATA[<div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;">	<table class="wsite-multicol-table">		<tbody class="wsite-multicol-tbody">			<tr class="wsite-multicol-tr">				<td class="wsite-multicol-col" style="width:21.910695742471%; padding:0 15px;">											<div><div style="height: 20px; overflow: hidden; width: 100%;"></div><hr class="styled-hr" style="width:100%;"></hr><div style="height: 20px; overflow: hidden; width: 100%;"></div></div>									</td>				<td class="wsite-multicol-col" style="width:50.950411566164%; padding:0 15px;">											<div class="paragraph"><span><span style="font-weight:700">Introduction<br /><br />&#8203;</span></span><span><span style="color:rgb(0, 0, 0)">Sankey Diagrams are used to visualise flows, processes and aggregated numbers. They are especially useful when breaking down a number into its component parts. The Sankey Diagram was first used by Matthew Sankey in 1898 to illustrate the flow of energy in a steam engine system. He wanted to visualise the energy efficiency of a steam combustion system by taking the input energy and sketching out where the energy went. </span><span>&#8203;</span></span><br /></div><div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"><a><img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/sankey_orig.jpg" alt="Picture" style="width:auto;max-width:100%" /></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph"><span><span style="color:rgb(0, 0, 0)">Bearing in mind Newton&rsquo;s first law of thermodynamics, the conservation of energy, this chart shows the energy usage in its entirety and all of the outputs of the system are equal to the input(s) to the system.</span></span><br /></div><div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"><a><img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/example-sankey_orig.png" alt="Picture" style="width:auto;max-width:100%" /></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph"><span><span style="color:rgb(0, 0, 0)"></span></span><span style="color:rgb(0, 0, 0)">Sankey Diagrams are primarily used in physics, particularly to visualise energy systems. However, with a bit of creative thinking, Sankey Diagrams can also be used to powerfully illustrate concepts in finance and operations.<br />&#8203;</span><span><span style="color:rgb(0, 0, 0)"><br />A Sankey Diagram might be helpful if you are trying to show the composition or aggregation of a number or if you are trying to link two sets of numbers which add up to the same total but are segmented differently (e.g. breakdown of costs by country and breakdown of costs by type).</span></span></div><div class="paragraph"><span><span style="color:rgb(0, 0, 0); font-weight:700">Example 1: P&amp;L Diagram<br /><br /></span></span><span><span style="color:rgb(0, 0, 0)">Sankey Diagrams can illustrate the breakdown of a P&amp;L, and how revenue is &lsquo;used&rsquo; by costs in the business to arrive at net profit. Although a slightly abstract concept, this can really help to understand the orders of magnitude of costs at different levels of a business. This chart below separates COGs, overheads and depreciation / finance costs to understand costs at the gross profit, operating profit, EBITDA and net profit levels.</span></span><br /><span></span></div><div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"><a><img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/sankey_orig.jpg" alt="Picture" style="width:auto;max-width:100%" /></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph"><span><span style="color:rgb(0, 0, 0)">An alternative diagram could show cash inflows and outflows in a business, with cash surplus on the right hand side of the diagram. If there is a cash flow deficit, this could be included on the side of cash inflows, to represent cash useage from cash reserves rather than cash inflows during the period.<br /><br /></span></span><span><span style="color:rgb(0, 0, 0); font-weight:700">Example 2: Interview Process<br /><br /></span></span><span><span style="color:rgb(0, 0, 0)">Processes can be tricky to visualise in an intuitive way. Sankey Diagrams can show what&rsquo;s gone where during a process, which is especially useful in multi-stage processes where the inputs can go down a variety of output paths.<br /><br /></span></span><span><span style="color:rgb(0, 0, 0)">One example of this is an interview process, where you have a homogenous input (applicants) and a variety of output paths that the applicant can go down, from falling at the first hurdle to finally accepting an offer. With many different possible outcomes, it can be difficult to visualise the process as a whole and understand what has happened to the applicants. Sankey Diagrams are a powerful way of showing a process in its entirety and giving a sense of scale.</span></span><br /><span></span></div><div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"><a><img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/interview-sankey_orig.jpg" alt="Picture" style="width:auto;max-width:100%" /></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph"><span><span style="color:rgb(0, 0, 0); font-weight:700">Summary<br /><br /></span></span><span><span style="color:rgb(0, 0, 0)">Sankey Diagrams are a different breed of visualisation to the standard line, bar and pie charts, and they can be used to create very intuitive diagrams. The best use cases for Sankey Diagrams are numbers that require breaking down into their component parts or visualising processes with a numerical element. The key concept to remember is that the inputs and outputs of a Sankey Diagram must be equal, similarly to Newton's law of the conservation of energy. Although these diagrams are normally used in physics and energy flow diagrams, with a little bit of creative thinking they can be applied to a much broader range of topics.<br /><br /></span></span><span><span style="color:rgb(0, 0, 0); font-weight:700">Coder&rsquo;s Corner<br /><br /></span></span><span><span style="color:rgb(0, 0, 0)">Creating Sankey Diagrams is a bit tricky, mainly because they are not standard visualisations that are readily available in common software. I think that the easiest way to create simple Sankey Diagrams is in Microsoft Power BI (the P&amp;L diagram was created using this software), however this software has limited customisation options for Sankey Diagrams.<br /><br /></span></span><span><span style="color:rgb(0, 0, 0)">If you are desperate, you can create illustrative Sankey Diagrams just using the shapes in PowerPoint or Google Slides, however the scales and sizes will probably be incorrect and creating the chart might be time consuming.<br /><br /></span></span><span><span style="color:rgb(0, 0, 0)">If you are more confident in Python, matplotlib includes Sankey Diagrams that can be customised and joined together. The interview process chart was created using Google Colabs with the below code. I cheated slightly by removing the labels in Python and re-including them as a text box in Google Slides, because I wanted to have text in different colours which is tricky to do in Python.<br /><br /></span></span><span><span style="font-weight:700">&#8203;Code -&gt;&gt;</span></span><br /><span></span></div><div id="195138403386225716"><div><style type="text/css">	#element-1a07872b-0b9d-48af-ba08-a311e805d6c8 .colored-box-content {  clear: both;  float: left;  width: 100%;  -moz-box-sizing: border-box;  -webkit-box-sizing: border-box;  -ms-box-sizing: border-box;  box-sizing: border-box;  background-color: #f4f7f8;  padding-top: 20px;  padding-bottom: 20px;  padding-left: 20px;  padding-right: 20px;  -webkit-border-top-left-radius: 0px;  -moz-border-top-left-radius: 0px;  border-top-left-radius: 0px;  -webkit-border-top-right-radius: 0px;  -moz-border-top-right-radius: 0px;  border-top-right-radius: 0px;  -webkit-border-bottom-left-radius: 0px;  -moz-border-bottom-left-radius: 0px;  border-bottom-left-radius: 0px;  -webkit-border-bottom-right-radius: 0px;  -moz-border-bottom-right-radius: 0px;  border-bottom-right-radius: 0px;}</style><div id="element-1a07872b-0b9d-48af-ba08-a311e805d6c8" data-platform-element-id="848857247979793891-1.0.1" class="platform-element-contents">	<div class="colored-box">    <div class="colored-box-content">        <div style="width: auto"><div></div><div class="paragraph"><span><span style="color:rgb(0, 128, 0); font-weight:700"># Import library</span></span><br /><strong><span><span style="color:rgb(175, 0, 219); font-weight:700">import</span><span style="font-weight:700"> matplotlib.pyplot </span><span style="color:rgb(175, 0, 219); font-weight:700">as</span><span style="font-weight:700"> plt</span></span><br /><span><span style="color:rgb(175, 0, 219); font-weight:700">from</span><span style="font-weight:700"> matplotlib.sankey </span><span style="color:rgb(175, 0, 219); font-weight:700">import</span><span style="font-weight:700"> Sankey</span></span></strong><br /><br /><span><span style="color:rgb(0, 128, 0); font-weight:700"># Set up chart</span></span><br /><span><span style="font-weight:700">fig = plt.figure(figsize = (</span><span style="color:rgb(9, 136, 90); font-weight:700">40</span><span style="font-weight:700">,</span><span style="color:rgb(9, 136, 90); font-weight:700">40</span><span style="font-weight:700">))</span></span><br /><span><span style="font-weight:700">ax = fig.add_subplot(</span><span style="color:rgb(9, 136, 90); font-weight:700">1</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">1</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">1</span><span style="font-weight:700">, xticks=[], yticks=[], title=</span><span style="color:rgb(163, 21, 21); font-weight:700">""</span><span style="font-weight:700">)</span></span><br /><span><span style="font-weight:700">sankey = Sankey(ax=ax, scale=</span><span style="color:rgb(9, 136, 90); font-weight:700">.05</span><span style="font-weight:700">, offset=</span><span style="color:rgb(9, 136, 90); font-weight:700">0.3</span><span style="font-weight:700">, unit=</span><span style="color:rgb(163, 21, 21); font-weight:700">''</span><span style="font-weight:700">)</span></span><br /><br /><span><span style="color:rgb(0, 128, 0); font-weight:700"># Stage 1 - Applicants -&gt; Passed 1st Interview</span></span><br /><span><span style="font-weight:700">sankey.add(flows=[</span><span style="color:rgb(9, 136, 90); font-weight:700">50</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">-12</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">-38</span><span style="font-weight:700">],</span></span><br /><span><span style="font-weight:700">labels=[</span><span style="color:rgb(163, 21, 21); font-weight:700">''</span><span style="font-weight:700">, </span><span style="color:rgb(163, 21, 21); font-weight:700">''</span><span style="font-weight:700">, </span><span style="color:rgb(163, 21, 21); font-weight:700">''</span><span style="font-weight:700">],</span></span><br /><br /><span><span style="color:rgb(0, 128, 0); font-weight:700"># Rotate arrows for candidates leaving the process</span></span><br /><span><span style="font-weight:700">orientations=[</span><span style="color:rgb(9, 136, 90); font-weight:700">0</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">0</span><span style="font-weight:700">,</span><span style="color:rgb(9, 136, 90); font-weight:700">-1</span><span style="font-weight:700">],</span></span><br /><span><span style="font-weight:700">trunklength = </span><span style="color:rgb(9, 136, 90); font-weight:700">3</span><span style="font-weight:700">,</span></span><br /><span><span style="font-weight:700">edgecolor = </span><span style="color:rgb(163, 21, 21); font-weight:700">'#323232'</span><span style="font-weight:700">,</span></span><br /><span><span style="font-weight:700">facecolor = </span><span style="color:rgb(163, 21, 21); font-weight:700">'#323232'</span><span style="font-weight:700">)</span></span><br /><br /><span><span style="color:rgb(0, 128, 0); font-weight:700"># Stage 2 - Passed 1st Interview -&gt; Outcomes</span></span><br /><span><span style="font-weight:700">sankey.add(flows=[</span><span style="color:rgb(9, 136, 90); font-weight:700">12</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">-1</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">-8</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">-3</span><span style="font-weight:700">],</span></span><br /><span><span style="font-weight:700">labels=[</span><span style="color:rgb(163, 21, 21); font-weight:700">''</span><span style="font-weight:700">,</span><span style="color:rgb(163, 21, 21); font-weight:700">''</span><span style="font-weight:700">, </span><span style="color:rgb(163, 21, 21); font-weight:700">''</span><span style="font-weight:700">, </span><span style="color:rgb(163, 21, 21); font-weight:700">''</span><span style="font-weight:700">],</span></span><br /><span><span style="font-weight:700">trunklength = </span><span style="color:rgb(9, 136, 90); font-weight:700">1</span><span style="font-weight:700">,</span></span><br /><span><span style="font-weight:700">pathlengths = [</span><span style="color:rgb(9, 136, 90); font-weight:700">2</span><span style="font-weight:700">,</span><span style="color:rgb(9, 136, 90); font-weight:700">2</span><span style="font-weight:700">,</span><span style="color:rgb(9, 136, 90); font-weight:700">2</span><span style="font-weight:700">,</span><span style="color:rgb(9, 136, 90); font-weight:700">2</span><span style="font-weight:700">],</span></span><br /><span><span style="font-weight:700">orientations=[</span><span style="color:rgb(9, 136, 90); font-weight:700">0</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">1</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">0</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">-1</span><span style="font-weight:700">],</span></span><br /><br /><span><span style="color:rgb(0, 128, 0); font-weight:700">#Link Stage 1 &amp; Stage 2</span></span><br /><span><span style="font-weight:700">prior=</span><span style="color:rgb(9, 136, 90); font-weight:700">0</span><span style="font-weight:700">,</span></span><br /><span><span style="font-weight:700">connect=(</span><span style="color:rgb(9, 136, 90); font-weight:700">1</span><span style="font-weight:700">, </span><span style="color:rgb(9, 136, 90); font-weight:700">0</span><span style="font-weight:700">),</span></span><br /><span><span style="font-weight:700">edgecolor = </span><span style="color:rgb(163, 21, 21); font-weight:700">'#323232'</span><span style="font-weight:700">,</span></span><br /><span><span style="font-weight:700">facecolor = </span><span style="color:rgb(163, 21, 21); font-weight:700">'#ff96c9'</span><span style="font-weight:700">)</span></span><br /><span><span style="font-weight:700">diagrams = sankey.finish()</span></span><br /><br /><span><span style="color:rgb(0, 128, 0); font-weight:700"># Set text size</span></span><br /><span><span style="color:rgb(175, 0, 219); font-weight:700">for</span><span style="font-weight:700"> diagram </span><span style="color:rgb(0, 0, 255); font-weight:700">in</span><span style="font-weight:700"> diagrams:</span></span><br /><span><span style="font-weight:700">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color:rgb(175, 0, 219); font-weight:700">for</span><span style="font-weight:700"> text </span><span style="color:rgb(0, 0, 255); font-weight:700">in</span><span style="font-weight:700"> diagram.texts:</span></span><br /><span><span style="font-weight:700">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;text.set_fontsize(</span><span style="color:rgb(9, 136, 90); font-weight:700">16</span><span style="font-weight:700">);</span></span></div></div>    </div></div></div><div style="clear:both;"></div></div></div>									</td>				<td class="wsite-multicol-col" style="width:27.138892691365%; padding:0 15px;">											<div><div style="height: 20px; overflow: hidden; width: 100%;"></div><hr class="styled-hr" style="width:100%;"></hr><div style="height: 20px; overflow: hidden; width: 100%;"></div></div>									</td>			</tr>		</tbody>	</table></div></div></div>]]></content:encoded></item><item><title><![CDATA[Violin plots]]></title><link><![CDATA[https://www.londonsoda.com/articles/violin-plots]]></link><comments><![CDATA[https://www.londonsoda.com/articles/violin-plots#comments]]></comments><pubDate>Tue, 29 Oct 2019 21:26:46 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.londonsoda.com/articles/violin-plots</guid><description><![CDATA[ 	 		 			 				 					 						       					 								 					 						  IntroductionViolin plots are used to intuitively show the distribution of data in a data set. If you want to understand things like the demographics of your product users or&nbsp;the range of revenue per customer across different product ranges, violin plots might be the thing for you.This article firstly covers how to interpret box plots, before setting out how to understand distribution using violin plots. Finally, it shows how to  [...] ]]></description><content:encoded><![CDATA[<div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:28.348909657321%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>				<td class="wsite-multicol-col" style="width:44.716935917782%; padding:0 15px;"> 					 						  <div class="paragraph"><span><span style="color:rgb(0, 0, 0); font-weight:700">Introduction</span></span><br /><br />Violin plots are used to intuitively show the distribution of data in a data set. If you want to understand things like the demographics of your product users or&nbsp;<span>the range of revenue per customer across different product ranges, violin plots might be the thing for you.</span><br /><br /><span><span style="color:rgb(0, 0, 0)">This article firstly covers how to interpret box plots, before setting out how to understand distribution using violin plots. Finally, it shows how to use the &lsquo;split violin plot&rsquo; to reveal a wealth of information about a data set in a single glance.</span></span><br /></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/split-violinplot-1_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">The data used in the examples below is completely fictional and is about the social media followers of retail company. The retail company owners want to understand a bit more about the ages of their social media followers across different channels. However, they are struggling to intuitively understand the large volume of data associated with their X00,000s social media followers.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">The retail company promote their brand using i) Instagram, ii) Facebook and iii) Twitter and want to optimise their content for each channel based on the age groups of their current follower base. The main questions they want to answer are:</span></span><br /><br /><ol><li style="color:rgb(0, 0, 0)">What is the distribution of my followers for each channel?</li><li style="color:rgb(0, 0, 0)">How does that compare across channels?</li></ol><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">Box plots</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">A great way to show the range of a numerical variable, such as age, is to plot the data in a box plot (also called a &lsquo;box-and-whisker&rsquo; chart). Box plots show the distribution of a numerical variable and are useful for showing whether the data points in the data set are tightly grouped or spread out and what the range of the data set it.</span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/boxplot-diagram_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">This chart is most effective when different categories of things are plotted on the same chart and can be compared. In the example of the retail business who wants to understand their followers across different social media channels, they can use box plots to plot the information together and gain insights from comparing the different channels, rather than reviewing in isolation.</span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/boxplot_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><font size="2"><span style="color:rgb(0, 0, 0); font-weight:700">Violin plots</span><br /><br /><span style="color:rgb(0, 0, 0)">While box plots are highly effective and widely used in data analytics, they are limited in the fact that they only show specific statistical points, such as the median average or outliers, rather than the distribution of a data set as a whole.</span><br /><br /><span style="color:rgb(0, 0, 0)">Violin plots focus on illustrating the distribution of the entire data set and can generate different insights, that are hidden in the structure of box plots.</span></font></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/violinplot-diagram_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">Although box plots are an intuitive way of understanding statistical metrics, such as interquartile range, outliers and median average, violin plots give a complete overview of a data set. Box plots are essentially summaries, meaning that the underlying distribution of the data driving the statistical metrics is obscured. <br /><br />&#8203;Comparatively, violin plots will give a complete overview of the distribution of data, which is especially powerful when comparing different categories within a data set, such as splitting data across seven charts to compare days of the week.&nbsp;<br /><br /></span></span><span><span style="color:rgb(0, 0, 0)">This is illustrated below, again with the example of a retail business looking to understand the ages of their social media followers. From the chart, you can quickly build intuitions about the age distributions across the channels.</span></span><br /><span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/violinplot_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">Although these charts look downright weird at first, they can sometimes be a more intuitive way of understanding the distribution of data points in a data set. The two main advantages of basic violin plots are:</span></span><ol><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Clusters</span><span> - Violin pots can identify unusual groups of data points. In the example above, in the Instagram channel there is a concentration of users below the Inter Quartile Range ("IQR"), users between 0-10, versus a large spread of users above the IQR, users over 30. This information is more difficult to pick out of the box plot above</span></span></li><li style="color:rgb(0, 0, 0)"><span><span style="font-weight:700">Quick &amp; powerful insights</span><span> - Fundamentally, the purpose of charts is to provide insights more intuitively and more quickly than looking at raw numbers. Once you get used to them, violin plots can give a birds eye view of an entire dataset with one glance, especially when used with multiple categories for comparisons.</span></span></li></ol><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">&#8203;Next level violin plots: The Split Violin Plot</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">The hidden power of violin plots is that they can be split across an additional category to give an extra level of comparative analysis. This is a unique feature of violin plots and allows for particularly useful insights and, if used in the correct scenario, it can create an extremely intuitive way of explaining complicated patterns of a data set.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">In the example below, the social media follower data is further split by sex of follower. As can be seen, this demonstrates the variation across category about the distribution of the ages of male and female followers. This additional split can only be used with a boolean (yes/no) variable.</span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/split-violinplot_1_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0); font-weight:700">Conclusion<br /><br />&#8203;</span></span><span><span style="color:rgb(0, 0, 0)">Violin plots are very handy to have in the data visualisation toolbox. They are highly effective in showing the distribution of data points in a data set in a clear and intuitive way, and are particularly useful when used to compare different categories of data points. Violin plots can be turbocharged be being split across a yes/no variable to give even greater insight.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">Coder&rsquo;s corner</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">I made these box &amp; violin plots using Python&rsquo;s Seaborn library in a Jupyter Notebook. The code below generates the final chart (split violin plot).</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">You can also make violin plots in Microsoft Power BI.</span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/python-code_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>   					 				</td>				<td class="wsite-multicol-col" style="width:26.934154424897%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>]]></content:encoded></item><item><title><![CDATA[Anscombe's Quartet]]></title><link><![CDATA[https://www.londonsoda.com/articles/anscombes-quartet]]></link><comments><![CDATA[https://www.londonsoda.com/articles/anscombes-quartet#comments]]></comments><pubDate>Wed, 21 Aug 2019 23:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.londonsoda.com/articles/anscombes-quartet</guid><description><![CDATA[ 	 		 			 				 					 						       					 								 					 						  IntroductionThis is a short article about an illustration called Anscombe&rsquo;s Quartet. It is an extreme example of how blind statistical analysis can trick you. It is also another reminder of the importance of visualising data in your EDA (Exploratory Data Analysis).Anscombe's QuartetBelow is a set of four distinct charts, collectively called Anscombe&rsquo;s Quartet. These charts represent four different sets of data, with no ob [...] ]]></description><content:encoded><![CDATA[<div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:21.931985050212%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>				<td class="wsite-multicol-col" style="width:50.932097983905%; padding:0 15px;"> 					 						  <div class="paragraph"><font size="2"><strong style="">Introduction</strong><br /><br /><span style="color: rgb(0, 0, 0);">This is a short article about an illustration called Anscombe&rsquo;s Quartet. It is an extreme example of how blind statistical analysis can trick you. It is also another reminder of the importance of visualising data in your EDA (Exploratory Data Analysis).<br /><br /><strong>Anscombe's Quartet</strong></span><br /><br /><span style="color: rgb(0, 0, 0);">Below is a set of four distinct charts, collectively called Anscombe&rsquo;s Quartet. These charts represent four different sets of data, with no obvious similarities between them. The first clue that there might be </span><span style="color: rgb(0, 0, 0);">some</span><span style="color: rgb(0, 0, 0);"> similarities between them is that they all share similar looking trend lines, which will be explained later on.</span><span style="font-weight: lighter;">&#8203;</span></font></div>  <div><div class="wsite-image wsite-image-border-medium " style="padding-top:0px;padding-bottom:0px;margin-left:0px;margin-right:0px;text-align:left"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/capture_1_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><ul><li><span><span style="color:rgb(0, 0, 0)"><strong>Chart I)</strong> Evenly distributed data points showing a clear linear trend;</span></span></li><li><span><span style="color:rgb(0, 0, 0)"><strong>Chart II)</strong> Evenly distributed data points showing a clear polynomial trend;</span></span></li><li><span><span style="color:rgb(0, 0, 0)"><strong>Chart III)</strong> A clear linear trend with one outlier that looks like it is skewing the line-of-best-fit; and</span></span></li><li><span><span style="color:rgb(0, 0, 0)"><strong>Chart IV)</strong> Grouped data points along the x-axis with no trend</span></span></li></ul><br /><span><span style="color:rgb(0, 0, 0)">Even from a quick glance, it is obvious that these charts represent diverse datasets and should be understood differently. From a predictive modelling point of view, the charts indicate what sorts of analysis could be used to generate predictions about new points of data, e.g. Chart III looks like it could be modelled using linear regression and Chart IV could be a classification problem.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">However, the statistical information about the same four charts paints a different and counterintuitive picture:</span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:0px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/capturefigs_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span style="color:rgb(0, 0, 0)"><br />&#8203;As can be seen in the table above, the Sum, Average and St.dev of the data points for each chart are identical.&nbsp;</span></div>  <blockquote><span><span style="color:rgb(0, 0, 0)">&#8203;<em>So despite the fact the the data looks very diverse when visualised, each data set actually contains identical statistical properties.</em></span></span></blockquote>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">&#8203;&#8203;</span></span><span><span style="color:rgb(0, 0, 0); font-weight:700">Distribution</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">This phenomenon crops up &lsquo;in-the-field&rsquo; very regularly, and in many different guises. The most common occurrence is when relying on averages without understanding the distribution of the underlying data, e.g. where two data sets have similar averages but wildly different distributions (in this case, the data can be visualised in a boxplot or violin plot to compare averages and distributions).</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">A very simple illustration of this would be measuring the average monthly revenue for two businesses, a wooly coat shop and an icecream shop. Both could have the same average monthly revenue over a year but completely different seasonal patterns.&nbsp;</span></span><br /><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">Outliers</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Another cause of this phenomenon is outliers. Outliers can skew a data set in a way that is&nbsp;</span></span><span style="color:rgb(0, 0, 0)">hidden in its statistical information but</span><span><span style="color:rgb(0, 0, 0)"> obvious when the data is visualised (see Chart III above).</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)"><a href="https://www.londonsoda.com/articles/dealing-with-outliers" target="_blank">Although outliers could be thought of as a subcategory of distribution problems, outliers are often more subtle than looking for broader distribution patterns and need different tools to find and deal with them. </a></span></span><br /><br /><span style="color:rgb(0, 0, 0)">Using data visualisation, as shown in Chart III, is a solid starting point for checking for hidden outliers. However, when dealing with extremely large data sets, where the outliers might be granular and not show up in a data visualisation, another approach might be to use more complicated statistical techniques such as using z-scores.</span><br /><br /><span><span style="color:rgb(0, 0, 0)"><strong>Summary</strong></span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Anscombe&rsquo;s Quartet is another quick, useful reminder of why it&rsquo;s important to use visualisations when exploring data and how high-level statistical information can be misleading when the data underlying those statistics is not fully understood. Remembering the message of Anscombe&rsquo;s Quartet can help reduce critical errors when analysing and modelling data, where errors can be caused by a number of reasons including lack of understanding of the distribution of a dataset and outliers.</span></span></div>   					 				</td>				<td class="wsite-multicol-col" style="width:27.135916965882%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>]]></content:encoded></item><item><title><![CDATA[Dealing with outliers]]></title><link><![CDATA[https://www.londonsoda.com/articles/dealing-with-outliers]]></link><comments><![CDATA[https://www.londonsoda.com/articles/dealing-with-outliers#comments]]></comments><pubDate>Sat, 22 Jun 2019 23:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.londonsoda.com/articles/dealing-with-outliers</guid><description><![CDATA[ 	 		 			 				 					 						       					 								 					 						  IntroductionAt their best, outliers can help understand the scope and limitations of a model. At their worst, they create hidden fundamental flaws data sets that can skew models and muddy the waters of a model&rsquo;s predictive power.&#8203;The method for dealing with outliers is often boiled down to &lsquo;search and destroy&rsquo;, which can lead to the loss of good data. But what if there was another way of dealing with outliers? [...] ]]></description><content:encoded><![CDATA[<div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:21.905859465224%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>				<td class="wsite-multicol-col" style="width:50.953567059974%; padding:0 15px;"> 					 						  <div class="paragraph"><font size="3"><span style="color: rgb(0, 0, 0); font-weight: 700;">Introduction</span><br /><br /><span style="color: rgb(0, 0, 0);">At their best, outliers can help understand the scope and limitations of a model. At their worst, they create hidden fundamental flaws data sets that can skew models and muddy the waters of a model&rsquo;s predictive power.<br />&#8203;</span><br /><span style="color: rgb(0, 0, 0);">The method for dealing with outliers is often boiled down to &lsquo;search and destroy&rsquo;, which can lead to the loss of good data. But what if there was another way of dealing with outliers? What if you could use outliers to your advantage? What are outliers anyway?</span></font></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:0px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/editor/dog-3689594-1920.jpg?1556219710" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><font size="3"><span><span style="color:rgb(0, 0, 0); font-weight:700">What is an outlier</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Although it is often common sense if a data point is an outlier, sometimes there are data points that lie at the margins and may or may not be outliers. In these instances, it is important to understand the essence of what outliers are and how they can be defined:</span></span></font></div>  <blockquote><span><span style="color:rgb(0, 0, 0)"><font size="5">An outlier is a data point that falls outside the scope of a model or the description of a group.</font></span></span></blockquote>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)"><font size="3">For example:</font></span></span><ol><li style="color:rgb(0, 0, 0)"><font size="3">a 7ft tall human is an outlier because they fall outside of our typical description of the group &lsquo;humans&rsquo;, who are almost all between 4.5ft and 6.5ft</font></li><li style="color:rgb(0, 0, 0)"><font size="3">the P/E ratio of Twitter would be an outlier for a model of P/E ratios for companies with revenue under $100m p.a., because Twitter would fall outside the scope of the model</font></li></ol><br /><font size="3"><span><span style="color:rgb(0, 0, 0)">This suggests that there is no such thing as a predetermined outlier.<br />&#8203;</span></span><br /><span><span style="color:rgb(0, 0, 0)">A data point is an outlier depending on i) its relative characteristics compared to a group and ii) the scope of the model that is included in. If you change group or the scope of the model, a data point could cease to be an outlier.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">For example:</span></span></font><ol><li style="color:rgb(0, 0, 0)"><font size="3">a 7ft tall human is not an outlier because they fall inside our typical description of the group &lsquo;mammals&rsquo;</font></li><li style="color:rgb(0, 0, 0)"><font size="3">The P/E ratio of Twitter would not be an outlier for a model of P/E ratios for companies with revenue under $10b</font></li></ol><br /><font size="3"><span><span style="color:rgb(0, 0, 0)">Defining outliers is not usually this semantic, however it is important to spend time before and during analysis to think about how the definition of your outliers impacts the scope of the model and the subject matter being analysed or modelled.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">How to spot outliers</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">By far the fastest and easiest way to spot potential outliers in a data set is through&nbsp;</span><span style="color:rgb(0, 0, 0); font-weight:700">visualisation</span><span style="color:rgb(0, 0, 0)">. It is best practice to visualise data in a variety of ways to spot data points that just don&rsquo;t look right. Although this seems a bit unscientific to draw up some charts and assess the data by eye, it is a crucial first step in spotting potential outliers.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">&#8203;Visualisation is typically the first thing people do when trying to understand a data set. This makes a lot of sense because human beings are built to&nbsp;</span><span style="color:rgb(0, 0, 0)">see</span><span style="color:rgb(0, 0, 0)">&nbsp;outliers. Take a look at the picture below and think how long it took you to spot the outlier!</span></span></font></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/jessica-ruscello-120986-unsplash-1_1_orig.jpg" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><font size="3"><span><span style="color:rgb(0, 0, 0)">Solving real world problems, there is another highly effective, and sometimes more thorough, method of finding outliers; </span><span style="color:rgb(0, 0, 0); font-weight:700">speaking to people</span><span style="color:rgb(0, 0, 0)">!</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">In practice, data sets have owners who can reveal additional and hidden characteristics of the data set that cannot be detected through visualisation. For example, when working with a SaaS business on a customer analysis project, I asked the CFO about what they would </span><span style="color:rgb(0, 0, 0)">expect</span><span style="color:rgb(0, 0, 0)"> to find in the data set. She told me that certain large and overseas customers have a completely different price plan than normal customers and that these customers would have a significantly lower &pound;/employee ratio. Having this conversation meant that these customers could be excluded from the main analysis and modelled separately without skewing the general population.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Without this conversation, this information would not have been picked up by a visualisation and the model would not have taken into account the divergent price plans.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Asking questions like &lsquo;what trends would you expect to find?&rsquo;, &lsquo;do you make any manual adjustments to the data set?&rsquo; and &lsquo;are there any one-offs in this data?&rsquo; can help get a better understanding of a data set and help define outliers.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0); font-weight:700">Dealing with outliers</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Once you have defined and identified outliers, the next step is dealing with them. In general, there are three common &nbsp;methods of dealing with outliers:</span></span></font><ol><li style="color:rgb(0, 0, 0)"><font size="3"><span><span>Exclude outliers by specific instances - e.g. excluding shop x from a model of retail park footfall because you know it is closing down;</span></span>&#8203;</font></li><li style="color:rgb(0, 0, 0)"><span><span><font size="3">Exclude outliers by thresholds - e.g. excluding times between 2100 and 0700 when modelling retail park footfall; and</font></span></span></li><li style="color:rgb(0, 0, 0)"><span><span><font size="3">Change the scope of the model or the definition of the group you are trying to model, as outlined above.</font></span></span></li></ol> <font size="3"><span><span style="color:rgb(0, 0, 0)">The main disadvantage of method 1) and 2) is you sometimes exclude valuable data. For example, you are creating a model for predicting</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">&#8203;For method 3), this is a subjective process and the pros and cons are contextual. To illustrate this, the chart below sets out some data points with some potential outliers and a linear line of best fit:</span></span></font></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:0px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/graph-1_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)"><font size="3">The decision is whether to <strong>a)</strong> exclude these outliers and improve the model's accuracy. These data points will be lost and the model&rsquo;s breadth of predicting power will be reduced:</font></span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/graph-2_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)"><font size="3">OR <strong>b)</strong>&nbsp;include the data points and change the scope of the model, which may reduce the accuracy of the model (vs. simply deleting the outliers) but will broaden the predictive power of the model:</font></span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/graph-3_orig.png" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><font size="3"><span><span style="color:rgb(0, 0, 0)">To preserve this valuable data, there is a less commonly used fourth method for dealing with outliers: </span><span style="color:rgb(0, 0, 0); font-weight:700">imputation</span><span style="color:rgb(0, 0, 0)">.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Imputation is the method of replacing data with substitute data. Often used to fill in </span><span style="color:rgb(0, 0, 0)">missing</span><span style="color:rgb(0, 0, 0)"> data, imputation can also be used to replace outliers. The benefit of imputation is that valuable data can be kept in the model to improve its accuracy.</span></span></font><br /><br /><font size="3" style="color:rgb(0, 0, 0)">&#8203;With the example of modelling footfall in a retail park, if one of the shops were closed for refurbishment 3 months of the year it would be an outlier. An easy way to prevent this outlier from skewing the data set would be to exclude it. An alternative approach could be to impute data from the same 3 months of the previous year, +/- a growth rate. This way you could retain the 9 good months and have a fair substitute for the outlier 3 months.</font><br /><br /><font color="#323232"><font size="4"><strong>Summary</strong></font><br /><br /><font size="3">At their best Outliers are subjective and depend on i) the definition of the group that the data point belongs to and ii) the scope of the model that the data points are being used in.&nbsp;<br /><br />The quickest way of spotting outliers is through visualisation and looking at which data points stand out or if there are any unusual visual patterns in the data. Speaking to people is another effective way of finding outliers, and can even identify hidden outliers that could not be found by analytically or visually interrogating the data.<br /><br />Dealing with outliers can be very straightforward, e.g. simply deleting the outlier, or more complicated if using imputation to substitute the outlier using other data.</font></font><br /><br /></div>   					 				</td>				<td class="wsite-multicol-col" style="width:27.140573474802%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>]]></content:encoded></item><item><title><![CDATA[Benford's law]]></title><link><![CDATA[https://www.londonsoda.com/articles/benfords-law]]></link><comments><![CDATA[https://www.londonsoda.com/articles/benfords-law#comments]]></comments><pubDate>Thu, 25 Apr 2019 23:00:00 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.londonsoda.com/articles/benfords-law</guid><description><![CDATA[ 	 		 			 				 					 						       					 								 					 						  &#8203;IntroductionIf you picked a random country and measured its population, there&rsquo;s roughly a 30.1% chance the first digit of that number is a &lsquo;1&rsquo; and a 17.6% chance that it is a &lsquo;2&rsquo;. The distribution of these first digits is known as Benford&rsquo;s Law.Benford&rsquo;s Law, also known as the &lsquo;leading digit rule&rsquo;, appears&nbsp;everywhere&nbsp;in economics, human geography, nature and sport [...] ]]></description><content:encoded><![CDATA[<div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"> 	<table class="wsite-multicol-table"> 		<tbody class="wsite-multicol-tbody"> 			<tr class="wsite-multicol-tr"> 				<td class="wsite-multicol-col" style="width:21.919247926602%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>				<td class="wsite-multicol-col" style="width:50.944831579181%; padding:0 15px;"> 					 						  <div class="paragraph"><font color="#323232"><font size="4"><span style="font-weight:700"><br />&#8203;Introduction</span></font><br /><br /><font size="3">If you picked a random country and measured its population, there&rsquo;s roughly a 30.1% chance the first digit of that number is a &lsquo;1&rsquo; and a 17.6% chance that it is a &lsquo;2&rsquo;. The distribution of these first digits is known as Benford&rsquo;s Law.<br /><br />Benford&rsquo;s Law, also known as the &lsquo;leading digit rule&rsquo;, appears&nbsp;<em>everywhere</em>&nbsp;in economics, human geography, nature and sports, but few people have ever heard of it.<br /><br />Why does this mathematical phenomenon exist? Where does it appear? So what?</font></font></div>  <div class="paragraph"><font size="3" color="#323232"><span><span style="font-weight: 700;">What is Benford&rsquo;s Law?</span></span><br /><br /><span>Think of all the numerical data that you have created over the last 7 days: bank transactions, journey times, journey distances, doorways walked through, milliliters of water drunk, how many words you&rsquo;ve spoken.</span><br /><br /><span>For each of these numbers, Benford&rsquo;s Law says that the likelihood of the first digit being a &lsquo;1&rsquo; is not 11.1% (1/9), but is in fact around 30.1%. Benford&rsquo;s Law sets out a distribution curve of leading digits, with &lsquo;1&rsquo; being the most likely and &lsquo;9&rsquo; being the least likely, as follows:</span></font></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/published/distribution.png?1556221400" alt="Picture" style="width:477;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><font size="3"><span><span style="color:rgb(50, 50, 50)">This distribution has been found to occur in a spooky amount of datasets. For example, below is the distribution of leading digits for the population of 240 countries. Although there are small divergences from Benford&rsquo;s Law, the pattern is clear:</span></span><span>&#8203;&#8203;</span></font></div>  <span class='imgPusher' style='float:left;height:0px'></span><span style='display: table;width:auto;position:relative;float:left;max-width:100%;;clear:left;margin-top:0px;*margin-top:0px'><a><img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/editor/vs-population_2.png" style="margin-top: 0px; margin-bottom: 10px; margin-left: 0px; margin-right: 0px; border-width:0; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="display:block;">&nbsp;</div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <div class="paragraph"><span><span style="color:rgb(50, 50, 50)"><font size="3">Here is the same analysis using annual &nbsp;GDP(US$):</font></span></span></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:0px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:left"> <a> <img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/published/vs-gdp.png?1556270334" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%"></div> </div></div>  <div class="paragraph"><font size="3"><span><span style="color:rgb(50, 50, 50); font-weight:700"><br />&#8203;When does this rule work?</span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">Fully explaining why this phenomenon occurs is at the </span><span style="color:rgb(50, 50, 50)">difficult difficult lemon difficult end of mathematics.</span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">However, in general the rule will work with most statistical data which spans several orders of magnitude, e.g. 10^</span><span style="color:rgb(50, 50, 50)">1-4.</span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">Counter-intuitively, it&rsquo;s probably easiest to understand the conditions where the distribution </span><span style="color:rgb(50, 50, 50)">won&rsquo;t</span><span style="color:rgb(50, 50, 50)"> be found:</span></span></font><ul><li><span><span style="color:rgb(50, 50, 50)"><span><font size="3"><span style="font-weight:700">Sequential numbers</span><span>: Any sets of numbers which form a consecutive or formulaic pattern will not work, e.g. invoice numbers, dates. This is because these data sets can, and probably will, have arbitrary cut-offs, starting points, end-points. There is no reason why a list of invoice numbers can&rsquo;t start at &lsquo;2...&rsquo; or &lsquo;3...&rsquo; or always start with a &lsquo;2002...&rsquo;</span></font></span></span></span></li><li><span><font size="3"><span style="font-weight:700">Max/min conditions</span><span>: If the number set has limits and thresholds it can skew the distribution. For example, if you asked people to &lsquo;pick a number between 150-550&rsquo;, you would not get a Benford&rsquo;s Law distribution</span></font></span></li><li><font size="3">&#8203;<span><span style="font-weight:700">Artificial clusters:</span><span> Sometimes humans like to make their life easy and measure things with convenient scales. Take human height as an example. The average UK woman is 162 cm tall, with very few being under 100cm or over 199cm and none being over 299cm</span></span></font></li><li><span><font size="3"><span style="font-weight:700">Human bias:</span><span> Data created by human decision making can carry inherent biases. For example, humans are influenced by price thresholds when making purchasing decisions, which is why many prices are &pound;x.99. This is also true when humans pick &lsquo;random&rsquo; numbers, where people will disproportionately choose 3 or 7 when picking a number between 1-10</span></font></span></li><li><span><font size="3"><span style="font-weight:700">&lsquo;Building block&rsquo; numbers: </span><span>Some recorded numbers are the result of combining other numbers. If you were looking at the distribution of leading digits of bets in a poker competition, the results might not follow a Benford&rsquo;s Law distribution because each bet is built up from fixed chip values. If the smallest chip size is $25, bets of $10, $20 and $30 would be impossible. Another example might be a fast food restaurant with a small menu, where there are relatively few combinations of items that can form a transaction value.</span></font></span></li></ul><br /><font size="3"><span><span style="color:rgb(50, 50, 50); font-weight:700">How can I use Benford&rsquo;s Law?</span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">When it was initially discovered in the 19th century and rediscovered in the mid-20th century, Benford&rsquo;s Law was filed away under &lsquo;interesting but not useful&rsquo;. In the pre-computer era, data collection and analysis was slow and painful. </span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">Today, it is extremely straightforward to check whether a dataset has a Benford&rsquo;s Law distribution of leading digits. I think that Benford&rsquo;s Law is an incredibly useful tool that can quickly help you understand the characteristics of a data set and identify hidden biases.</span></span>&#8203;</font></div>  <span class='imgPusher' style='float:left;height:0px'></span><span style='display: table;width:auto;position:relative;float:left;max-width:100%;;clear:left;margin-top:0px;*margin-top:0px'><a><img src="https://www.londonsoda.com/uploads/9/5/6/2/95626400/published/vs-human.png?1556270376" style="margin-top: 0px; margin-bottom: 10px; margin-left: 0px; margin-right: 0px; border-width:0; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="display:block;"></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <div class="paragraph"><span><span style="color:rgb(50, 50, 50)"><font size="3">The most common uses of Benford&rsquo;s Law in data and analytics are:<br />&#8203;</font></span></span><ul><li style="color:rgb(50, 50, 50)"><font size="3"><span style="font-weight:700">Fraud:</span>&nbsp;Benford&rsquo;s Law became fashionable again when people realised its use in fraud detection. This follows from the&nbsp;<span style="font-weight:700">human bias</span>&nbsp;point made above where humans will make strange decisions when choosing &lsquo;random&rsquo; numbers.<br /><br />When I was in forensic accounting, we used Benford&rsquo;s Law to sense check accounting&nbsp;software entries. If large numbers of transactions had been entered &lsquo;randomly&rsquo; by a&nbsp;human there would be a distinct divergence away from a Benford&rsquo;s Law distribution.<br /><br />This is an example of what the leading digit distribution pattern of a human entered dataset might look like versus Benford&rsquo;s Law:</font></li></ul><ul><li style="color:rgb(50, 50, 50)"><font size="3"><span><span style="font-weight:700">Bias: </span><span>This is the most useful application of Benford&rsquo;s Law. It can help detect hidden biases in a data set, possibly caused by some of the conditions outlined above under which Benford&rsquo;s Law won&rsquo;t work.</span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">This is incredibly useful for data analysis and creating machine learning models because </span></span><span><span style="color:rgb(50, 50, 50)">it helps remove unidentified biases from datasets that can lead to a misleading, at best, </span></span><span><span style="color:rgb(50, 50, 50)">or fundamentally flawed models.<strong>&nbsp;</strong></span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">For example, you take a dataset of credit card transactions and compare the leading</span></span>&nbsp;<span><span style="color:rgb(50, 50, 50)">digit distribution to Benford&rsquo;s Law. The analysis shows that there is a higher than expected proportion of &lsquo;5&rsquo;s and &lsquo;6&rsquo;s as the leading digit of the transactions. On further investigation, you find out that in December 2018 the credit card company ran a promotion for transactions on electronic goods of &pound;500 or above. These transactions are excluded from the model where appropriate.</span></span></font></li></ul><br /><font size="3"><span><span style="color:rgb(50, 50, 50); font-weight:700">&#8203;Summary</span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">Benford&rsquo;s Law, or the &lsquo;leading digit rule&rsquo;, is a solid combination of being interesting and useful. It's also much more applicable than given credit for.&nbsp;</span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">Not only is it a quick and high level data exploration tool which can help understand a dataset, it can be a robust forensic technique that can be and has been used as legal evidence.&nbsp;</span></span><br /><br /><span><span style="color:rgb(50, 50, 50)">Understanding the characteristics of datasets under which Benson&rsquo;s Law won&rsquo;t work is the key to applying and using it in data and analytics.</span></span></font></div>   					 				</td>				<td class="wsite-multicol-col" style="width:27.135920494217%; padding:0 15px;"> 					 						  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>   					 				</td>			</tr> 		</tbody> 	</table> </div></div></div>  <div class="wsite-spacer" style="height:50px;"></div>  <div><div style="height: 20px; overflow: hidden; width: 100%;"></div> <hr class="styled-hr" style="width:100%;"></hr> <div style="height: 20px; overflow: hidden; width: 100%;"></div></div>  <div class="wsite-spacer" style="height:50px;"></div>]]></content:encoded></item></channel></rss>