Data visualisation is one of the most important tools in a data scientist’s toolbox to present ideas discovered from your data. It enables us to illustrate complex information to anyone in a succinct manner, even to people without technical knowledge. One of the best visualisation techniques for data separated geospatially is to visualise this information on a map. In this blog, we will go through the steps involved in time series analysis for plotting a time series interactive map for the spread of Coronavirus (COVID-19). To this effect, we will be using GeoPandas for plotting choropleth maps using geojson data, and Ipywidgets for generating interactive time series slider. At the end of this blog, you should be able to generate a plot similar to the one shown below

Time Series Analysis of COVID-19 Spread-1

There are a lot of official and unofficial data sources on the web providing COVID-19 related data. One of the most widely used dataset today is the one provided by the John Hopkins University’s Center for Systems Science and Engineering (JHU CSSE). Here is the Github link for the same: Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE. I have used global confirmed cases time series data for this blog.

Time Series Analysis: Importing Data

COVID-19 Spread
Image created using Canva

Directly importing the global confirmed cases into a pandas Dataframe from the link is always better than downloading the file to the local system and then loading the data frame since the file in GitHub will always be updated.

confirmed_link = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
confirmed = pd.read_csv(confirmed_link)

Next step is to do some initial analysis to understand the shape and type of data that we have imported.

confirmed.columns
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
      '1/24/20', '1/25/20', '1/26/20', '1/27/20',
      ...
      '6/22/20', '6/23/20', '6/24/20', '6/25/20', '6/26/20', '6/27/20',
      '6/28/20', '6/29/20', '6/30/20', '7/1/20'],
      dtype='object', length=166)
confirmed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Columns: 166 entries, Province/State to 7/1/20
dtypes: float64(2), int64(162), object(2)
memory usage: 345.1+ KB
confirmed.head()

Time Series Analysis: Data Wrangling

The data seems to be spread for each day as separate columns. We need the data to be normalised to different rows (so that the date can be passed as one of the arguments to generate the time series maps later on). Data frame is melted using the pandas melt function for this.

Some of the countries have Province/State updated and some don’t, but we need the confirmed cases to be aggregated per country, so we need to drop the Province/State column and group by the Country/Region column to get the sum of confirmed cases for each country. There are also some additional columns that are not necessary for our purpose.

# moving the columns into rows using dataframe’s melt functionconfirmed = confirmed.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name="Date", value_name="Confirmed")#drop unnecessary columns
confirmed = confirmed.drop(['Province/State','Lat','Long'],axis=1)
# for some countries there are multiple rows since the province data was populated, we need # to group by country and date after the province column is droppedconfirmed = confirmed.groupby(['Country/Region','Date']).sum()
confirmed = confirmed.reset_index()
# Update the Country column to match with the world dataframeconfirmed['Country'] = confirmed['Country/Region'].map(country_map)confirmed.loc[~confirmed['Country'].notnull(), 'Country'] = confirmed.loc[~confirmed['Country'].notnull(), 'Country/Region']
confirmed
Area wise COvid 19 spread

Now, the dataframe is in the format that we want to generate the required maps.

Time Series Analysis with GeoPandas

GeoPandas is an open source project that makes working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. A GeoPandas dataframe will have an additional column that provides geometry of a location that can then be plotted easily by using matplotlib’s plot function.

The goal of GeoPandas is to make working with geospatial data in python easier. It combines the capabilities of pandas and enhances it by adding the capability to represent the data on a map based on the co-ordinates present in the geometry column, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely. GeoPandas enables you to easily do operations in python that would otherwise require a spatial database such as PostGIS.

GeoJSON is a format for encoding a variety of geographic data structures.GeoJSON supports many geometry types (Point, LineString, Polygon, MultiPoint, MultiLineString, and MultiPolygon). A GeoJSON file can be directly loaded onto a GeoPandas data frame, the coordinates contained in the file will be loaded as the geometry column in GeoPandas data frame.

For our purpose we need coordinates for all the countries and GeoPandas has a preloaded dataset that can be directly used so we don’t have to import a separate GeoJSON file.

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world
Time Series Analysis of COVID-19 Spread-2

Generating Static choropleth maps

Choropleth maps are maps where the colour of each shape is based on the value of an associated variable. In our case, the confirmed number of COVID-19 cases will be the variable on which the choropleth will be generated. 

Next, we need to merge the world dataframe and the confirmed dataframe. Since both the dataframes are from different sources there will be some dissimilarities especially in the column that we will be using to merge these data frames. Some of the countries do not match with each other, we have to update the values in the confirmed data frame before merging.

country_map = {'Bosnia and Herzegovina':'Bosnia and Herz.'
, 'Central African Republic':'Central African Rep.'
, "Cote d'Ivoire":"Côte d'Ivoire"
, 'Dominican Republic':'Dominican Rep.'
, 'Equatorial Guinea':'Eq. Guinea'
, 'Eswatini':'eSwatini'
, 'South Sudan':'S. Sudan'
, 'Taiwan*':'Taiwan'
, 'US':'United States of America'
, 'Western Sahara':'W. Sahara'}cdf['Country'] = cdf['Country/Region'].map(country_map)cdf.loc[~cdf['Country'].notnull(), 'Country'] = cdf.loc[~cdf['Country'].notnull(), 'Country/Region']

Since we cannot plot time series choropleths in GeoPandas directly, we will just take one day into a temporary dataframe

confirmed20 = confirmed[confirmed[‘Date’]==’6/20/20′]

A merged data frame is generated based on confirmed20 and world GeoPandas dataframe

cworld = world.merge(confirmed20,how= 'left',left_on='name',right_on='Country')
cworld = cworld[world.name!="Antarctica"]
cworld.Confirmed = cworld.Confirmed.fillna(0)

Finally, generate the choropleth map

#Generate the choropleth map using gdf plot function on the Confirmed columnfig = cworld.plot(column='Confirmed',cmap='cool',figsize=(18,10), legend = True
            ,legend_kwds={'label': "No of Confirmed COVID-19",
                          'orientation': "horizontal"})
#removing axis ticks
plt.axis('off')#Add the title
plt.title("Confirmed COVID-19 cases per Country")
plt.show()
Time Series Maps

Time Series Analysis: Interactive Time Series Widget

Now that we are able to generate a static choropleth map for one day, we’ll see how to use IPython widgets ( ipywidgets), to build interactive controls with one line of code. This library allows us to turn Jupyter Notebooks from static documents into interactive dashboards, perfect for exploring and visualising data.

In the last section we generated a choropleth map using GeoPandas dataframe, now we will build on this and create a function that takes date as input and generates the same plot. Later this function will be passed to an IPython widget namely “interact” which passes the date range to the function to generate an interactive time series plot.

mergedworld = pd.DataFrame()
for i in confirmed.Date.unique():
    mergetemp=world.merge(confirmed[confirmed.Date==i],left_on='name' ,right_on='Country',how='left')
    mergedworld = mergedworld.append(mergetemp)
def worldplot(date):
    mergedworld[mergedworld.Date==date[0]].plot(column='Confirmed',figsize=(20,9), legend = True)

interact(worldplot,date=selection_range_slider)
Time Series Maps-2
Time Series Maps-3

You can see how much more effective and engaging a time series choropleth map is compared to a static one. Here, we’re looking at the total number of confirmed cases of the coronavirus by country over time. It can be observed by just looking at the map for a few seconds of that initial China was leading the number of cases and then the US took over and has been leading ever since, while Chinese confirmed cases have been constant and other countries have taken over in the recent months.

Further Reading

After initial analysis you can learn about predicting and applying data modelling on the same Coronavirus data, this blog provides more details: Data Modelling & Analysing Coronavirus (COVID19) Spread using Data Science & Data Analytics in Python Code

For details on each of the visualisation steps provided in this blog and much more, refer to the data visualisation section of the data science career track with Springboard.