Therefore, we will start slowly and introduce basic concepts first to gradually increase the level — and with that also increase the power of the methods applied.
Our series will cover these topics, amongst others:
- Why data science (this post)
- Basic terminology and exploratory analysis
- Applying network science methods
- Combining new unsupervised machine learning methods with classic statistical analyses
- Thoughts on the ethics of data science and machine learning in the public sector
Alright, let’s get started. Suppose our goal is to find interventions which increase household income in some specific region. How would we go about measuring the impact of our intervention and the factors that make it successful? Of course we would first look at the “data”.
What is data? Is the average account balance of a group (say individuals of a certain age and region) over a 5-year period a datum? Or the monthly expenses and incomes of one individual household?Well, yes, yes and yes. The problem is, different aspects of data, which we should disentangle from each other. We thus realize that there are various hierarchies in “data”. One useful concept here is the Data Information Knowledge Wisdom (DIKW) pyramid:
DIKW pyramid with the data-driven decision process when using traditional methods.
(Note that I’ve added some arrows here that show a cyclical process of data-driven decision making: gather data, extract insights from it and finally make data-driven decision and act.)
We can see a clear hierarchy of usefulness between the levels. Because you cannot use raw data (the data in the bottom of the pyramid) to make decisions — think of a very big excel-sheet without any formulas — you need to process them. We need to go up the pyramid. However, there is a clear dilemma here, which I will call width vs depth. Let me explain:
First: Here’s a working definition of width: the number of datapoints for a fixed set of factors. As we go up the pyramid, the width gradually decreases. For example, in the bottom part of the pyramid, we have one datapoint for each household’s income per month, and on the top we could have averages per group, region or intervention type. We achieve this narrowing of width by decreasing the number of changing variables that we look at. This concentration, in turn, increases the usability and “value” of each datapoint. After all an average income of a group tells us more than a single household’s income does. It also allows us to determine whether certain regions have improved or if some interventions were especially effective.
Now we can identify depth as this abstract “value” of each datapoint. It is basically the amount of actionable information we can extract. This is clear when we compare these to datapoints: an average income of a region in a month vs the income of one household in that region in a month.
But note that in the process of going up the pyramid, we have not merely concentrated the data, but also lost valuable insights on the way. Even worse, we presumably didn’t even notice. We could for example find that a region was well above the poverty-line in average incomes, while in reality there might be a few high-earners but most have low incomes. Or the incomes of the individuals could be below the poverty line for one half of the year and above in the other half. Forget averages, even slightly richer measures such as the Gini-coefficient would not help us here. We have simply distilled the information too much.
The key take-away is that we need to learn as much as possible from the raw, not-easy-to-look-at data. We need to keep the same depth (insights) but have more width (more perspective). Luckily, new methods from data science come to the rescue.
Data + Science = Data Science
The idea of data science: extract and combine as much useful insights as possible by combining loads of data intelligently. You can clearly see in the below picture how the pyramid achieves the same height (i.e. depth) but the results are much wider. This allows us to extract insights from differing perspectives, allowing a more holistic picture.
DIKW pyramid with the data-driven decision process when using data science methods.
The science part of “data science” is usually defined as a combination of computer science, statistics and domain knowledge. This is typically depicted in a chart like this (this one comes from Berkeley):
Data science combines multiple disciplines into one objective: same depth, more width. Data science is thus the intersection between proper domain knowledge, statistics and coding skills.
One way we could use data science is to understand in which circumstances the interventions for raising household incomes work:
- Doing exploratory analysis of the most bottom-level data (see next post).
- Analyzing outliers: this would correspond to mapping the individuals which spent most months under the poverty-line to find common properties.
- Find individuals with similar properties, but which did not dip below the poverty-line. Use this data to find under which circumstances the interventions worked and when they didn’t.
- Improve the policy and track relevant data to ensure alignment with the program objective.
We will get into the nitty-gritty of how to do these things in the next posts.
We need data science to make the best use of the available data. For this, we need to have datapoints that carry insights and give multiple perspectives — datapoints that are deep and wide.
The public sector has a responsibility to avoid missing relevant effects on specific groups as they can be hidden in averages and aggregates. Applying proper data science methods thus enables decision makers to have a direct positive impact on human lives. This can make the difference between a family stuck in poverty due to inadequate analysis and statistical limitations and false conclusions or an evidence-based policy design that fosters prospering communities.