Currently, there are so many dashboards and statistics around the Coronavirus spread available all over the internet. With so much information and expert opinions, to see different nations adopting different strategies, from complete lockdown to social distancing to herd immunity, one is left thinking as to what the right strategy is for them. Is there any basis to these opinions and advice? This blog is an attempt of data modelling and analysing Coronavirus (COVID-19) spread with the help of data science and data analytics in python code. This analysis will help us to find the basis behind common notions about the virus spread from purely a dataset perspective. So, let’s flex some data science muscles and jump right into it.

Data Modelling & Analysing Coronavirus: Getting the Dataset

There are a lot of official and unofficial data sources on the web providing COVID-19 related data. One of the most widely used dataset today is the one provided by the John Hopkins University’s Center for Systems Science and Engineering (JHU CSSE). Here is the Github link for the same: Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE

I have used the time series + consolidated data for all the analysis in this blog. Direct Link

The data is split across the following three files: 

  1. time_series_19-covid-confirmed_global.csv
  2. time_series_19-covid-deaths_global.csv
  3. time_series_19-covid-recovered_global.csv ** 
    **JHU will stop updating the recovered cases soon (as per their Github post)
  4. cases_country.csv

Data Scientists, Epidemiologists and Researchers all over the world are doing some excellent work to analyze the COVID-19 data too. At places in this blog, you will find references to such works (along with sources). I encourage you to visit them too. 

Inspired by this analysis and want to learn how to do it / wish to replicate this for your project? We can help you there. Just leave your email address in this google form and we will share the analysis with you within 48 hours.

Importing & Understanding the Dataset

Using pandas, the dataset can be directly imported into data-frames. It is much better to use the URLs (specified in the section above) than to download the file manually and then read it as it becomes easier to load and refresh the analysis with new data. 


In the analysis, I will be using the above-mentioned data-frames from time to time to subset/filter the data for our use. 

Here is a view of the columns in the confirmed_cases data frame.


Columns are the same in the first three data frames. All three of them contain time-series data related to the 177 countries i.e. Confirmed, Deaths and Recovered cases (till that date). 

The country cases data-frame is not a time series but an aggregated data with an additional feature “Active”.


First, I will do some exploratory analysis on the data and summarize some stats and plot some trends in the existing data. Then I will model the data on the SIR epidemic model and try to predict the count of cases in the upcoming days. 

Data Modelling & Analysing Coronavirus: Exploratory Analysis 

Let’s have a look at the situation so far and where we stand globally today. For this, you can use the “sum()” function on the cases_country_df. 

Global Summary of Cases by Confirmed, DDeaths, Recovered , Active
Case Count Summary Globally as on 28th March 2020

Here is the first reference to an external source which shows the spread on a geo-map. You can try and code something similar to this. 

Image taken from JHU CSSE Covid 19 dashboard
Clusters of Confirmed Coronavirus Cases worldwide (Source:

And, to figure how this spread has progressed over a period of time, I plotted the confirmed cases using plotly.graph_objects. 

Total Confirmed Coronavirus Cases (Globally)

The sharp exponential curve that can be seen on the right side of the graph shows the devastating rate at which the pandemic is spreading worldwide. Before further drill-down, I looked at the progression of recovered, death and active cases as well. 


Between 3/1 and 3/13, there were more Recovered cases than Active cases but after 3/14, the trend reversed and the gap between the number of Recovered cases and Active cases started increasing very rapidly. 

Drilling down at the country level, the top 20 countries (in terms of confirmed cases) the current situation is as shown below. Also, notice the neat “bars” in each cell!

A screenshot of a cell phone

Description automatically generated

India right now is at #41 (the data is 0 indexed)


If you see in the table above, the China is far ahead in the “Recovery” count of the cases, while US leads in terms of Testing/Confirmation of cases. The deaths in Italy are the maximum. Also, the number of active cases in Italy are the highest. 

Data Modelling & Analysing Coronavirus: The INDIA Focus

In this section, I will focus on the data points with respect to India. For this, data needed to be filtered out from each of the data-frame conditionally for India. This can be done as follows:


So, let’s have a look at how this virus has spread across India so far by plotting the four India specific time-series and annotating those with the events manually. 

A close up of a map

Description automatically generated

One can easily see that there was hardly any spread in India (recorded officially) till 03/02/2020. And all 3 confirmed cases had also recovered by then. Let’s focus more on the time interval post 03/02/2020 and try to overlay this time duration with the National Govt.’s response. 

Post 15th March, the Indian Govt has taken two major steps of “Closing all international land borders” and “Nation-wide Lockdown”. How significant these steps turn out to be, cannot be ascertained from the data, at least right now. 

Let’s further analyse and see how the virus transmission happened across India. The India state-level data is not present in the dataset that I have used in the analysis so far. But here is a GIF which shows the spread in an animation. Also shown alongside is a static heatmap of case counts in India. One can see that the most impacted states are Maharashtra and Kerala.

A close up of a map

Description automatically generated

A more scientific way to look at the India data or even the global data would be to look at it on a Semi-Log scale. This is how the visualization would be on a semi-log scale. This can be achieved by a small change in the y-axis setting (type = “log”). 


One can observe the sharp rise (and falls, if any) very easily in this kind of visualization. It is useful for data with exponential relationships, or where one variable covers a large range of values. In our scenario, the case counts are increasing exponentially. While for India focus, it is still manageable right now, but going forward this plot will be much more intuitive and useful. 

With this, we come to an end of the exploratory analysis of the existing data. In the next section, I will use this dataset for data modelling and prediction of the spread of the disease.

Data Modelling and Prediction

Just because the rise in number of cases is exponential, it does not imply that we can fit the data to an exponential curve and predict the number of cases in the coming days. Compartmental model techniques are normally used to model infectious diseases. Same could be used in the case of  COVID-19 too. The simplest compartmental model is the SIR model. The following excerpt  from this source link describes the model and its basic blocks. 

The model consists of three compartments: S for the number of susceptible, I for the number of infectious, and R for the number of recovered or deceased (or immune) individuals. This model is reasonably predictive for infectious diseases which are transmitted from human to human, and where recovery confers lasting resistance, such as measles, mumps and rubella.

Each member of the population typically progresses from susceptible to infectious to recovered. This can be shown as a flow diagram in which the boxes represent the different compartments and the arrows the transition between compartments, i.e.

SIR Model

In multiple models developed for COVID-19 (diffusion medium: Airborne Droplet) by experts and researchers they try to estimate the right set of parameters for the region/country. As per the CDC and WHO, the R0 for COVID-19 is definitely above 2. Some sources say it is between 3-5. 

In the model, the value R0 is an estimate of the number of people an average infected person will spread the disease to. If the value of R0 is greater than 1 then the disease probably continues to spread and if it is < 1 then the disease slowly dies down. Since COVID-19’s R0 is > 2, so an average infected person spreads it to 2 or more people who again spread it to 2 or more people and that is how this infection continues to spread across the globe. There are other parameters in the model like and which needs to be estimated. You can read more about the model params and related differential equations here. 

As a matter of fact, there is a well-documented example in the scipy package on SIR model. Check out this link for more clarity on the calculations of these parameters. I also came across a blog “COVID-19 dynamics with SIR model” on how to estimate these parameters from available COVID-19 data. It turns out that the differential equations can be easily solved and tuning of the parameters of the model can be done using the “solve_ivp” function in the scipy module.  

The predict and train functions are defined as follows:


The loss function is defined as follows:


Simulations with Actual Data for Italy & India

After modifying the code from the reference blog a bit, I was able to run the simulation for SIR model for Italy and India. For Italy, I ran the code on defaults and for India, I tried various combinations of parameters. Here are some of the results of the simulation. 


A screenshot of a cell phone

Description automatically generated

It can be observed that the model looks like a  good approximation.

  • Infected data & Infected curve are close
  • Recovered data & Recovered are also close.
  • Learnt values: country=Italy, beta=0.00000233, gamma=0.01875791, r_0:0.00012435

*results are for N=100,000

For India, For the sake of Data Modelling ☺

An initial population size of 1,00,000 is too small an estimate for a country like India given that I am assuming that  < than 0.01% of the population is at the risk. This is by and large a big underestimate. But still, it is a good point to begin the data modelling. So, running the model with S_0: 1,00,000, Start-Date: 1/30/20, I_0: 3 & prediction for 200 days, the following is the output:

A screenshot of a cell phone

Description automatically generated

Because the confirmed cases themselves have been too low, the plot lines for actuals are not that clear/distinct. And the “Infected” trace does not even come down below the 60,000 mark even after August. Well, usually predictions based on 1 month of data won’t be that accurate if one tries to predict for 7 months. Let’s have a closer look at the next 7-14 days in this data. Replotting the data only till 15th April gives the following output. For this step, the simulation run earlier to generate the plots also generates a csv with the data-dump of the findings. I used that to generate the following plot. 

A screenshot of a cell phone

Description automatically generated
A close up of a device

Description automatically generated

Note that the “Susceptible” series needed to be ignored to have a clearer plot. Scrolling in and focusing on the recent week, we see that the predicted “Infected” curve is already behind the “Infected Data” curve. So, it is indeed an under-estimation. Even with this under-estimate, by 15th April, India will be well above the 3500 mark. 

The same plot including the “Susceptible” series can be viewed on a Semi-Log plot as shown below:

A close up of a map

Description automatically generated

Even with this big under-estimate, India looks likely to cross the 2000 mark by 4/7 and 3500 mark by 4/13. 

Once again, I would like to call out that these illustrations are merely an approach for data modelling and analysis of the spread. These numbers should not be considered as an accurate prediction of the spread for next two weeks or two hundred days. That is best left for the expert epidemiologists and researchers in the field. 

Have a look at this awesome Epidemic Calculator which is an elaboration  of SEIR model for COVID-19.  It lets you play around with the parameters and observe the differences in the curve.

Inspired by this analysis and want to learn how to do it / wish to replicate this for your project? We can help you there. Just leave your email address in this google form and we will share the analysis with you within 48 hours.

So, What Can We Do?

Since there is no vaccine available right now, the only way to handle the spread is to slow down the transmission. As it can be seen even in the under-estimates and from the actual data around us, the sharply increasing number of cases is bound to overwhelm the medical infrastructure of any nation. So, by slowing down the transmission, we don’t actually stop the spread but keep the transmission and the active cases at any point in time well within the limits of the medical handling capacity. This is what is being referred to as “Flattening The Curve”. 

Here is a GIF that shows the impact of “Flattening The Curve”.

A screenshot of a cell phone

Description automatically generated
Source: World Economic Forum (Flattening The Curve)
Data Science

But How do we Flatten the Curve?

Since the virus is being spread from one human to another, experts suggest three things that can help flatten the curve:

  1. Travel Restrictions 

It is quite obvious that by restricting people from travelling in or out of a particular region, the transmission could be reduced. The question is for how long and how much is the transmission reduced. Here is an image which shows how much delay in spread a travel restriction could have caused in China (excluding Wuhan). 

A picture containing drawing

Description automatically generated

The difference isn’t noticeable because only a travel ban would have caused a delay of only days and people would have kept on infecting each other within the reason. 

2. Social Distancing 

This is what reduces transmission significantly. Now as we saw in the SIR modelling that COVID-19 has a high R0 and each infected person ends up infecting 2- 3 people and so on and so forth. So, maintaining social distance during these times will definitely help reduce the transmission from the infected to the others. Here is a very simple GIF that illustrates the impact of social distancing. 

A close up of a logo

Description automatically generated

3. More Testing

This is to quickly identify and isolate the infected from the non-infected. Given that Covid-19 has a long incubation period (symptoms start appearing after 5-7 days), a person does not even realize that he/she is infected and, in the meantime may spread the infection too. To be able to do this, extensive testing is required. Doctors and medical staff need to be provided with safety equipment. Laboratories need to procure testing kits. Hospitals need to have ICUs and quarantine units in large numbers. Most of these are infrastructure problems at national and international levels. And therefore, the need to “Flatten the Curve”.

The combined impact of Transmission reduction and Travel restriction can be seen in the illustration below: 

A screenshot of a cell phone

Description automatically generated

As you can see in the illustration, if transmission rate of COVID-19 went down by even 25%, it could delay the peak by almost 14 weeks. Further reduction would delay it even more. 

Last but not the least, Do Not Panic. Every information that comes on social media may not be true. Do not self-medicate, and report to doctors if you observe any symptoms of the disease. 

By the time I finished writing this blog, the numbers had changed. Luckily, I could refresh my analysis code and here is where we stand. 

Case Count Summary Globally as on 29th March 2020

Requesting the readers to provide feedback and comments and please point out any errors in the analysis. If you want something to be added to the analysis, feel free to post that too. I will shortly publish a cleaned up Jupyter Notebook with the complete code for all the analysis and data modelling done in this blog. Upskill yourself with online courses in data science and related areas. Consider Springboard’s 1:1 mentoring-led project-driven online learning program in data science and data analytics. We not only help you learn data science but also coach and mentor you to get the best jobs and command top salary!

Inspired by this analysis and want to learn how to do it / wish to replicate this for your project? We can help you there. Just leave your email address in this google form and we will share the analysis with you within 48 hours.


Apart from the datasets mentioned above, there is a much bigger dataset  COVID-19 Open Research Dataset (CORD-19). This is the dataset which is the base of a few Kaggle competitions announced a few days ago. 

Reference Links: