This is the second post (look here for the first one) in a series where we look at what data science can do for decision makers, analysts and program managers in the public and social sectors. Although this specific analysis will focus on the political preferences of a certain population, the methodology is applicable much more broadly. It could also inform what types of food displaced people prefer to have when given a choice or how a community with different groups could be best educated about an upcoming vaccination initiative.
Today we will start with some actual data science and cover the basic terminology and some exploratory analysis tools. Looking back at the DIKW action cycle we covered in the last post we will focus on the first step, i.e. getting to know the data. The goal of the exploratory analysis we will be looking at is to generate hypotheses that can be analyzed and tested in the next steps. For that we will look at the 2017 German federal election, first explaining the data and then looking at its properties and visualizations. By applying exploratory analytical tools we are able to arrive at some thought-provoking hypotheses that invite further explorations.
Our data and our challenge
For this post and the next few posts, we will, as a case study, be trying to analyze and understand the public in a political context. More specifically, we will distill the determinants of social media success of German parties in the 2017 election.
While many more other datasets are available, we’ve picked a dataset of German speaking Twitter users for three main reasons:
- It has a huge potential to extract insights. This is because it is very fine-grained: our individual data points are single users or single tweets.
- Is necessitates the use of data science. Old school methods will not give us many insights on a dataset with >9 Million rows.
- It is a good example of big data. While Twitter is publicly available, organizations have many similarly-sized, proprietary datasets which contain many insights. Thus, when reading this analysis, try to imagine what would be the equivalent when applied to your dataset and your challenges.
The data. We will focus on the followership structure of users. That means that for each of the more than 9 Million users, we will have information on who they are following and also how many followers they have. In particular, we will be interested in whichpeople follow which political parties and how they are interconnected. In short, we are trying to understand the networks, their effects and individuals’ roles in them.
Features, distributions, averages.
When we look at data, we often look at the things that change between individual data points. If you imagine the Twitter dataset as a giant excel spreadsheet with many rows, then each rows represents a user and the columns are given by the number of accounts that follow that user, and the accounts that are being followed by the user. If there is a “1” in a certain column, that means that that user is following that account associated with the column. To explore the properties of the data we will first look at main features, then dive into how party accounts were affected and lastly put everything together and look at the overall network of social media activity. Now, we are interested in the structures that are present in the columns. To signify that these are the interesting factors in our analysis, these properties are called features. Let’s first look at the distribution of two main features: the number of follows and the number of followers:
Cumulative distribution of the number of follows and followers for our dataset.
We can immediately find that our dataset contains few users with very high number of follows (given in blue). These users follow far more than 1000 users and could potentially be non-natural users, i.e.bot-accounts. We have even less users in our dataset with an extremely high number of followers. This is because of the way our data was collected: we focus on the users, not on popular “sites” like news outlets’ accounts. We can also see that the mean number of follows is high (due to the bots), but its median is in fact a lot lower, at 164. Because of potentially extreme and uninformative outliers, it’s much better to use the median in this case.
There is a lot more to find even on this highly aggregated level, but we’ll dive into more tangible stuff now.
We look at how the number of followers has changed with time for each party account, as calculated from our data. We do this both for before and after the election, and the results are given in Table 1. [As a precautionary note, some adjustment and careful interpretation is necessary, as twitter is not representative of the population , .]
Note, if you’re unfamiliar with the German parties, CDU are the conservative, SPD the social democratic, Grüne the green, FDP the liberal, Linke the left-wing and AfD the new right-wing party respectively. For more details see here.
What we find from Table 1:
- Unequivocal decreases in general followers of the CDU, Grüne and SPD
- Highest increase in AfD followership but also a bit in FDP
It is also interesting to look at the “hardliners” – the users that only follow one party’s account, but none of the others. The changes for these are in Table 2.
What we find in Table 2:
- Highest decrease in hardliners for SPD and CDU
- Highest increase hardliners in FDP and AfD
If we combine both tables, we get a clearer picture:
- Linke and Grüne gained a bit in hardliners but lost in general followership. This might be interpreted as having captured the more extreme users but not the `average voters’
- Decreases in hardliners of CDU and SPD are likely to have ended up in the increase of the general followership of FDP and mostly AfD. This means that previously hardliner followers of SPD and CDU now switched to also following the AfD
Further things one could explore include:
- Is there a connection between number of accounts followed vs. likelihood of following party?
- Do the absolute values reflect election outcome when normalised for party share and twitter demographics?
- Which accounts stick out as most relevant for followers of specific parties?
Visualization with networks
Now we will use the same information that we have used before but put it in a different context: we look at how the users are connected with each other. For this, we will use network visualization as a tool that can distill structures in the millions of relationships between the individual users.
Below you will see two snapshots of the network of twitter users that follow each other. Each dot is a user, each edge shows a follower-relationship and we color them by which party they follow.
Top: before election. Bottom: after the election. Only hardliners are shown, users that follow multiple parties are blacked out.
What we find from the pre-election snapshot:
- All users colored by party followership form clearly separated clusters. This is especially intriguing, as the party-labels were not used for generating the layout. This suggests hidden structure in the dataset
- The AfD cluster is further away than the rest, indicating very different internal structure and less linkages to the others
- The Grüne are very central, in between multiple other parties
What we see after the election:
- AfD has one giant cluster and occupies a very distinct corner in the network, i.e. its position as an outsider, but the internally well connected “ball” has further strengthened
- Other parties still occupy certain areas but these have become less distinct
(Some) Hypotheses we can generate from this:
- The new right-wing party has a very distinct structure differing from the mainstream parties
- While after the election, all other parties seem well mixed, the right-wing cluster remains remote and non-mixed. This might have the effect of creating an echo chamber, where information does not flow well into the cluster from the outside
- It might also lead to a feedback loop, in which people from the mainstream get information from all political perspectives (and thus have to struggle to find consistency), while the AfD users only information receive information from within their cluster, further strengthening their message and convincing close and undecided users
With fine-grained data and exploratory analyses using network science, we can come up with interesting hypotheses which are impossible to get when relying on “old-school” techniques and software. The power of running complex analysis on large datasets quite easily makes the difference. Thus, we have now covered the first level of the DIKW pyramid and understand the data while already having generated some insights that can now be tested further. You can look forward to that in our next blog post.