Data science /

Tracking COVID-19 - what data can tell us (and what it can't)

April 5/5 min read
  • Matteo Barbieri
    Matteo BarbieriData Scientist

Data science is more than just plotting curves and computing p-values - extracting valuable knowledge and presenting it in a clear way should ultimately be the goal of any study.

In this post I will guide you through my journey in performing some basic data visualization and analysis on the available COVID-19 data using simple tools and focusing on the process I followed, discussing what I wanted to understand (i.e., which questions I was trying to answer) and how I wanted to do that, as well as how to present my results with tables and plots, all while being mindful of the limitations in my approach.

It was a Sunday morning, the 8th of March 2020. Up to that moment, I’ll admit that my level of attention for the whole COVID-19 had been relatively low. Sure, I was keeping an eye at the daily reports about Italy (I’m Italian and my family lives there), but I hadn’t yet put enough effort into doing any kind of analysis. Something though had been bugging me for a couple of days about the numbers I was reading, but I couldn’t quite figure out what it was. So I decided to invest a couple of hours in creating a few basic plots and see what I could make of that information. Spoiler: a few hours later I was sending a 4 pages document to a few selected friends and family members (it was a delicate moment and didn’t want to inadvertently spread panic) to warn them of how the situation would very likely have evolved in the next few days. And it wasn’t pretty. Later that night three entire regions in Italy were put in lockdown, and the measure would have been extended to the whole country the following day.

But let’s start from the beginning.

Setup

The very first thing was of course getting access to data, which had to be:

  • Reliable
  • Updated regularly (daily)
  • Easily accessible

A quick search on google led me to a github repository with enough stars to appear legit, you can find it here https://github.com/CSSEGISandData/COVID-19 [1]. I cloned it, created a virtual environment with the standard packages for this kind of analysis (matplotlib/seaborn, pandas, numpy etc.) and got to work.

Exploratory analysis

I created a new notebook, imported the required libraries and created variables to store folder paths.

Checkout code here

The next step was loading the actual data. Using glob I collected the full paths of all relevant files in the folder. The function can also filter files based on their path using more or less sophisticated regular expressions, but this was not the case here.

The names of the files are just the date of the day in which the data was collected, in the “MM-DD-YYYY” format. It was then pretty straightforward to write a simple helper function extract_date to parse the string using datetime.strptime. Then I loaded data from all files in a single pandas DataFrame.

Checkout code here

At this point I had all the raw materials (the data) and the environment had been properly set up, it was time to begin with the actual analysis.

I started this because I wanted to see how the disease was progressing in my country, Italy, so I filtered data retaining only the statistics relative to Italy, and with a little preprocessing manually added an additional column for the date and another one for the daily increase of confirmed cases.

Checkout code here

I proceeded by plotting on the same chart the number of total confirmed cases, deaths and people recovered from the disease. I also added the number of new cases, still on the same plot but using a secondary axis in order not to squash that curve (the number of daily new cases is for obvious reasons much lower than the number of total cases).

When I saw the plot, I immediately recognized the trend: exponential growth. That’s what had been bugging me, I could see the numbers rising in the news but didn’t realize the specific nature of the trend.

Fitting a model

The exponential growth of the number of people affected by the disease was clearly bad news, for one reason in particular: between 40% and 50% of infected people need hospitalization, and 10% (of the total) [2] require specific life-saving treatments such as assisted ventilation, and the capacity of the national health system (NHS) for these cases in particular is clearly limited and was not designed for extreme scenarios like this one.

The next question was then how long will the NHS hold, given the current growth? To answer it, I fit a simple model using linear regression on the logarithm of the values against the number of days since the beginning of the outbreak (if the trend is exponential, the logarithm of the trend is linear). However it is necessary to take into account that fitting log-transformed data with a linear regressor tends to underestimate the errors for larger values; it is thus necessary to use a slightly modified version of the linear regression that uses the original (unmodified) values to weight errors for individual samples. In the plots below the difference between the two versions should be evident.

Now, considering that the model is very limited in its power, since it doesn’t take into account the real complexity of the situation, I expected it to be reliable for predicting the next couple of days at most. I used it anyway (knowing that it would have been a very rough estimate) to predict at which point the number of people infected would have been so large that the NHS would have been overwhelmed. Assuming a 10% rate of people needing treatment in intensive care units and an estimated 5000 total number of such units available in the whole country that would translate to the question “when will we hit 50,000 cases”? The answer is easily found in the plot below.

Luckily this catastrophic scenario did not become reality (at least not yet, as I’m writing the situation is critical in many regions of Italy), since right after the initial outbreak some measures had been taken to contain at least partially the spread of the disease in the hardest hit regions.

The rest of the world

One of the characteristics of this outbreak that in my opinion is the main cause of governments waiting too long before taking any kind of serious measure to contain it, is the fact that these seem a disproportionate response to a problem that, in the early stages, seems to be of limited size and thus not to pose a serious threat. However if we adjust the dates so that numbers refer to roughly the same “period” (the x axis refers to “days from the start of the outbreak”), it becomes pretty obvious that countries with a relatively limited number of cases but following the same trend should have many reasons to worry.

In the plot below I adjusted the dates of Italy, Sweden and Norway in order to show the similarities between the trends in those and other european countries. The situation in Spain, France and Germany was essentially identical to the one in Italy just nine days before.

Take home message

The goal of any study, be it a large set of experiments that takes months to complete or a quick exploratory run on a small dataset, is to shed light on some phenomenon. These are the principles that I follow whenever I perform any kind of data analysis task:

  • Focus on questions. What are you trying to understand? What do you think is going on, what’s the thesis you’re trying to prove (or to confute)? Plan ahead your analysis and don’t try to adjust it to the results you find. Which does not mean that you cannot change the course of the analysis if you genuinely think it’s a good idea, just be honest when/if the results are not the ones you were expecting (see next point).
  • Uncertainty is part of the process. Results are often inconclusive. There’s no shame in saying “I don’t know”, and it’s definitely better than ending up with wild claims unsupported by data (or even worse forged by adopting poor/criminal practices such as p-hacking).
  • Present your results with plots and tables (and make them clear). Ideally, a plot should be pretty self explanatory if the context is known. This means using proper labels on the axis, a proper choice of colors, Plots must make things clearer, not the other way around: if something you produce looks like anything taken from this page, it violates this very basic rule. Same goes for table, also try to keep it light in terms of visual impact: this video is a nice example.

References

  1. At the moment of writing the article, a change has been made to the format of files that requires some small modifications in the loading function in order to access most recent data; data older than 23/3/2020 should still be available, I’ll try to keep the code up to date.
  2. These estimates were valid when the analysis was first done, on the 8th of March; the situation has since evolved and they have changed.
    • Matteo Barbieri

      Matteo Barbieri

      Data Scientist

      Matteo Barbieri is a data scientist at Peltarion, as a member of the applied AI team. He has a strong interest in using AI to create valuable solutions for interesting use cases. He previously worked in the field of computer vision and completed a PhD in Computer Science at the University of Genoa, researching the use of machine learning techniques in the analysis of biomolecular data.